Submitted by:
| # | Name | Id | |
|---|---|---|---|
| Student 1 | Nitzan Madar | 203483334 | NitzanMadar@Campus.Technion.ac.il |
| Student 2 | Guy Bar-Shalom | 313537896 | Guy.B@Campus.Technion.ac.il |
| Student 3 | Hod Paska | 204441810 | Hodp@Campus.Technion.ac.il |
In this assignment we'll explore deep reinforcement learning. We'll implement two popular and related methods for directly learning the policy of an agent for playing a simple video game.
hw1, hw2, etc).
You can of course use any editor or IDE to work on these files.In the tutorial we have seen value-based reinforcement learning, in which we learn to approximate the action-value function $q(s,a)$.
In this exercise we'll explore a different approach, directly learning the agent's policy distribution, $\pi(a|s)$ by using policy gradients, in order to safely land on the moon!
%load_ext autoreload
%autoreload 2
%matplotlib inline
import unittest
import os
import sys
import pathlib
import urllib
import shutil
import re
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
test = unittest.TestCase()
plt.rcParams.update({'font.size': 12})
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Prefer CPU, GPU won't help much in this assignment
device = 'cpu'
print('Using device:', device)
# Seed for deterministic tests
SEED = 42
Using device: cpu
Some technical notes before we begin:
xvfb-run command to create a virtual screen. For example,srun do
srun -c2 --gres=gpu:1 xvfb-run -a -s "-screen 0 1440x900x24" python main.py run-nb <filename>
srun -c2 xvfb-run -a -s "-screen 0 1440x900x24" python main.py prepare-submission ...
xvfb-run command inside the jupyter-lab.sh script, so you can use it as usual with srun.
and so on.gym library is not officially supported on windows. However it should be possible to install and run the necessary environment for this exercise. However, we cannot provide you with technical support for this. If you have trouble installing locally, we suggest running on the course server.Recall from the tutorial that we define the policy of an agent as the conditional distribution, $$ \pi(a|s) = \Pr(a_t=a\vert s_t=s), $$ which defines how likely the agent is to take action $a$ at state $s$.
Furthermore we define the action-value function, $$ q_{\pi}(s,a) = \E{g_t(\tau)|s_t = s,a_t=a,\pi} $$ where $$ g_t(\tau) = r_{t+1}+\gamma r_{t+2} + \dots = \sum_{k=0}^{\infty} \gamma^k r_{t+1+k}, $$ is the total discounted reward of a specific trajectory $\tau$ from time $t$, and the expectation in $q$ is over all possible trajectories, $ \tau=\left\{ (s_0,a_0,r_1,s_1), \dots (s_T,a_T,r_{T+1},s_{T+1}) \right\}. $
In the tutorial we saw that we can learn a value function starting with some random function and updating it iteratively by using the Bellman optimality equation. Given that we have some action-value function, we can immediately create a policy based on that by simply selecting an action which maximize the action-value at the current state, i.e. $$ \pi(a|s) = \begin{cases} 1, & a = \arg\max_{a'\in\cset{A}} q(s,a') \\ 0, & \text{else} \end{cases}. $$ This is called $q$-learning. This approach aims to obtain a policy indirectly through the action-value function. Yet, in most cases we don't actually care about knowing the value of particular states, since all we need is a good policy for our agent.
Here we'll take a different approach and learn a policy distribution $\pi(a|s)$ directly - by using policy gradients.
We define a parametric policy, $\pi_\vec{\theta}(a|s)$, and maximize total discounted reward (or minimize the negative reward): $$ \mathcal{L}(\vec{\theta})=\E[\tau]{-g(\tau)|\pi_\vec{\theta}} = -\int g(\tau)p(\tau|\vec{\theta})d\tau, $$ where $p(\tau|\vec{\theta})$ is the probability of a specific trajectory $\tau$ under the policy defined by $\vec{\theta}$.
Since we want to find the parameters $\vec{\theta}$ which minimize $\mathcal{L}(\vec{\theta})$, we'll compute the gradient w.r.t. $\vec{\theta}$: $$ \grad\mathcal{L}(\vec{\theta}) = -\int g(\tau)\grad p(\tau|\vec{\theta})d\tau. $$
Unfortunately, if we try to write $p(\tau|\vec{\theta})$ explicitly, we find that computing it's gradient with respect to $\vec{\theta}$ is quite intractable due to a huge product of terms depending on $\vec{\theta}$: $$ p(\tau|\vec{\theta})=p\left(\left\{ (s_t,a_t,r_{t+1},s_{t+1})\right\}_{t\geq0}\given\vec{\theta}\right) =p(s_0)\prod_{t\geq0} \pi_{\vec{\theta}}(a_t|s_t)p(s_{t+1}|s_t,a_t). $$
However, by using the fact that $\grad_{x}\log(f(x))=\frac{\grad_{x}f(x)}{f(x)}$, we can convert the product into a sum: $$ \begin{align} \grad\mathcal{L}(\vec{\theta}) &= -\int g(\tau)\grad p(\tau|\vec{\theta})d\tau = -\int g(\tau)\frac{\grad p(\tau|\vec{\theta})}{p(\tau|\vec{\theta})}p(\tau|\vec{\theta})d\tau \\ &= -\int g(\tau)\grad\log\left(p(\tau|\vec{\theta})\right)p(\tau|\vec{\theta})d\tau \\ &= -\int g(\tau)\grad\log\left( p(s_0)\prod_{t\geq0} \pi_{\vec{\theta}}(a_t|s_t)p(s_{t+1}|s_t,a_t) \right) p(\tau|\vec{\theta})d\tau \\ &= -\int g(\tau)\grad\left( \log p(s_0) + \sum_{t\geq0} \log \pi_{\vec{\theta}}(a_t|s_t) + \sum_{t\geq0}\log p(s_{t+1}|s_t,a_t) \right) p(\tau|\vec{\theta})d\tau \\ &= -\int g(\tau)\sum_{t\geq0} \grad\log \pi_{\vec{\theta}}(a_t|s_t) p(\tau|\vec{\theta})d\tau \\ &= \E[\tau]{-g(\tau)\sum_{t\geq0} \grad\log \pi_{\vec{\theta}}(a_t|s_t)}. \end{align} $$
This is the "vanilla" version of the policy gradient. We can interpret is as a weighted log-likelihood function. The log-policy is the log-likelihood term we wish to maximize and the total discounted reward acts as a weight: high-return positive trajectories will cause the probability of actions taken during them to increase, and negative-return trajectories will cause the probabilities of actions taken to decrease.
In the following figures we see three trajectories: high-return positive-reward (green), low-return positive-reward (yellow) and negative-return (red) and the action probabilities along the trajectories after the update. Credit: Sergey Levine.
![]() |
![]() |
The major drawback of the policy-gradient is it's high variance, which causes erratic optimization behavior and therefore slow convergence. One reason for this is that the log-policy weight term, $g(\tau)$ can vary wildly between different trajectories, even if they're similar in actions. Later on we'll implement the loss and explore some methods of variance reduction.
In the spirit of the recent achievements of the Israeli space industry, we'll apply our reinforcement learning skills to solve a simple game called LunarLander.
This game is available as an environment in OpenAI gym.
In this environment, you need to control the lander and get it to land safely on the moon. To do so, you must apply bottom, right or left thrusters (each are either fully on or fully off) and get it to land within the designated zone as quickly as possible and with minimal wasted fuel.
import gym
# Just for fun :) ... but also to re-define the default max number of steps
ENV_NAME = 'Beresheet-v2'
MAX_EPISODE_STEPS = 300
if ENV_NAME not in gym.envs.registry.env_specs:
gym.register(
id=ENV_NAME,
entry_point='gym.envs.box2d:LunarLander',
max_episode_steps=MAX_EPISODE_STEPS,
reward_threshold=200,
)
import gym
env = gym.make(ENV_NAME)
print(env)
print(f'observations space: {env.observation_space}')
print(f'action space: {env.action_space}')
ENV_N_ACTIONS = env.action_space.n
ENV_N_OBSERVATIONS = env.observation_space.shape[0]
<TimeLimit<LunarLander<Beresheet-v2>>> observations space: Box(-inf, inf, (8,), float32) action space: Discrete(4)
The observations at each step is the Lander's position, velocity, angle, angular velocity and ground contact state. The actions are no-op, fire left truster, bottom thruster and right thruster.
You are highly encouraged to read the documentation in the source code of the LunarLander environment to understand the reward system,
and see how the actions and observations are created.
Let's start with our policy-model. This will be a simple neural net, which should take an observation and return a score for each possible action.
TODO:
PolicyNet class in the hw4/rl_pg.py module.
Start small. A simple MLP with a few hidden layers is a good starting point. You can come back and change it later based on the the experiments.build_for_env method to instantiate a PolicyNet based on the configuration of a given environment.part1_pg_hyperparams() in hw4/answers.py.import hw4.rl_pg as hw4pg
import hw4.answers
hp = hw4.answers.part1_pg_hyperparams()
# You can add keyword-args to this function which will be populated from the
# hyperparameters dict.
p_net = hw4pg.PolicyNet.build_for_env(env, device, **hp)
p_net
PolicyNet(
(model): Sequential(
(0): Linear(in_features=8, out_features=512, bias=True)
(1): ReLU()
(2): Linear(in_features=512, out_features=256, bias=True)
(3): ReLU()
(4): Linear(in_features=256, out_features=4, bias=True)
)
)
Now we need an agent. The purpose of our agent will be to act according to the current policy and generate experiences.
Our PolicyAgent will use a PolicyNet as the current policy function.
We'll also define some extra datatypes to help us represent the data generated by our agent.
You can find the Experience, Episode and TrainBatch datatypes in the hw4/rl_data.py module.
TODO: Implement the current_action_distribution() method of the PolicyAgent class in the hw4/rl_pg.py module.
for i in range (10):
agent = hw4pg.PolicyAgent(env, p_net, device)
d = agent.current_action_distribution()
test.assertSequenceEqual(d.shape, (env.action_space.n,))
test.assertAlmostEqual(d.sum(), 1.0, delta=1e-5)
print(d)
tensor([0.2600, 0.2317, 0.2640, 0.2443])
TODO: Implement the step() method of the PolicyAgent.
agent = hw4pg.PolicyAgent(env, p_net, device)
exp = agent.step()
test.assertIsInstance(exp, hw4pg.Experience)
print(exp)
Experience(state=tensor([-9.2669e-04, 1.4198e+00, -9.3879e-02, 3.9301e-01, 1.0806e-03,
2.1265e-02, 0.0000e+00, 0.0000e+00]), action=2, reward=-2.3340698261237547, is_done=False)
To test our agent, we'll write some code that allows it to play an environment. We'll use the Monitor
wrapper in gym to generate a video of the episode for visual debugging.
TODO: Complete the implementation of the monitor_episode() method of the PolicyAgent.
env, n_steps, reward = agent.monitor_episode(ENV_NAME, p_net, device=device)
To display the Monitor video in this notebook, we'll use a helper function from our jupyter_utils and a small wrapper that extracts the path of the last video file.
import cs236781.jupyter_utils as jupyter_utils
def show_monitor_video(monitor_env, idx=0, **kw):
# Extract video path
video_path = monitor_env.videos[idx][0]
video_path = os.path.relpath(video_path, start=os.path.curdir)
# Use helper function to embed the video
return jupyter_utils.show_video_in_notebook(video_path, **kw)
print(f'Episode ran for {n_steps} steps. Total reward: {reward:.2f}')
show_monitor_video(env)
Episode ran for 134 steps. Total reward: -110.33
The next step is to create data to train on. We need to train on batches of state-action pairs, so that our network can learn to predict the actions.
We'll split this task into three parts:
Episodes, by using an Agent that's playing according to our current policy network.
Each Episode object contains the Experience objects created by the agent.Episodes into a batch of tensors to train on.
Each batch will contain states, action taken per state, reward accrued, and the calculated estimated state-values.
These will be stored in a TrainBatch object.TODO: Complete the implementation of the episode_batch_generator() method in the TrainBatchDataset class within the hw4.rl_data module. This will address part 1 in the list above.
import hw4.rl_data as hw4data
def agent_fn():
env = gym.make(ENV_NAME)
hp = hw4.answers.part1_pg_hyperparams()
p_net = hw4pg.PolicyNet.build_for_env(env, device, **hp)
return hw4pg.PolicyAgent(env, p_net, device)
ds = hw4data.TrainBatchDataset(agent_fn, episode_batch_size=8, gamma=0.9)
batch_gen = ds.episode_batch_generator()
b = next(batch_gen)
print('First episode:', b[0])
test.assertEqual(len(b), 8)
for ep in b:
test.assertIsInstance(ep, hw4data.Episode)
# Check that it's a full episode
is_done = [exp.is_done for exp in ep.experiences]
test.assertFalse(any(is_done[0:-1]))
test.assertTrue(is_done[-1])
First episode: Episode(total_reward=-209.82, #experences=86)
TODO: Complete the implementation of the calc_qvals() method in the Episode class.
This will address part 2.
These q-values are an estimate of the actual action value function: $$\hat{q}_{t} = \sum_{t'\geq t} \gamma^{t'-t}r_{t'+1}.$$
np.random.seed(SEED)
test_rewards = np.random.randint(-10, 10, 100)
test_experiences = [hw4pg.Experience(None,None,r,False) for r in test_rewards]
test_episode = hw4data.Episode(np.sum(test_rewards), test_experiences)
qvals = test_episode.calc_qvals(0.9)
qvals = list(qvals)
expected_qvals = np.load(os.path.join('tests', 'assets', 'part1_expected_qvals.npy'))
for i in range(len(test_rewards)):
test.assertAlmostEqual(expected_qvals[i], qvals[i], delta=1e-3)
TODO: Complete the implementation of the from_episodes() method in the TrainBatch class.
This will address part 3.
Notes:
TrainBatchDataset class provides a generator function that will use the above function to lazily generate batches of training samples and labels on demand.PyTorch dataloader to wrap our Dataset and provide us with parallel data loading for free!
This means we can run multiple environments with multiple agents in separate background processes to generate data for training and thus prevent the data loading bottleneck which is caused by the fact that we must generate full Episodes to train on in order to calculate the q-values.DataLoader's batch_size to None because we have already implemented custom batching in our dataset.num_workers parameter in the hyperparams dict. Set num_workers=0 to disable parallelization.from torch.utils.data import DataLoader
hp = hw4.answers.part1_pg_hyperparams()
ds = hw4data.TrainBatchDataset(agent_fn, episode_batch_size=8, gamma=0.9)
dl = DataLoader(
ds,
batch_size=None,
num_workers=hp['num_workers'],
multiprocessing_context='fork' if hp['num_workers'] > 0 else None
)
for i, train_batch in enumerate(dl):
states, actions, qvals, reward_mean = train_batch
print(f'#{i}: {train_batch}')
test.assertEqual(states.shape[0], actions.shape[0])
test.assertEqual(qvals.shape[0], actions.shape[0])
test.assertEqual(states.shape[1], env.observation_space.shape[0])
if i > 5:
break
#0: TrainBatch(states: torch.Size([699, 8]), actions: torch.Size([699]), q_vals: torch.Size([699])), num_episodes: 8) #1: TrainBatch(states: torch.Size([694, 8]), actions: torch.Size([694]), q_vals: torch.Size([694])), num_episodes: 8) #2: TrainBatch(states: torch.Size([716, 8]), actions: torch.Size([716]), q_vals: torch.Size([716])), num_episodes: 8) #3: TrainBatch(states: torch.Size([653, 8]), actions: torch.Size([653]), q_vals: torch.Size([653])), num_episodes: 8) #4: TrainBatch(states: torch.Size([660, 8]), actions: torch.Size([660]), q_vals: torch.Size([660])), num_episodes: 8) #5: TrainBatch(states: torch.Size([635, 8]), actions: torch.Size([635]), q_vals: torch.Size([635])), num_episodes: 8) #6: TrainBatch(states: torch.Size([755, 8]), actions: torch.Size([755]), q_vals: torch.Size([755])), num_episodes: 8)
As usual, we need a loss function to optimize over. We'll calculate three types of losses:
We have derived the policy-gradient as $$ \grad\mathcal{L}(\vec{\theta}) = \E[\tau]{-g(\tau)\sum_{t\geq0} \grad\log \pi_{\vec{\theta}}(a_t|s_t)}. $$
By writing the discounted reward explicitly and enforcing causality, i.e. the action taken at time $t$ can't affect the reward at time $t'<t$, we can get a slightly lower-variance version of the policy gradient:
$$ \grad\mathcal{L}_{\text{PG}}(\vec{\theta}) = \E[\tau]{-\sum_{t\geq0} \left(\sum_{t'\geq t} \gamma^{t'}r_{t'+1} \right)\grad\log \pi_{\vec{\theta}}(a_t|s_t)}. $$In practice, the expectation over trajectories is calculated using a Monte-Carlo approach, i.e. simply sampling $N$ trajectories and average the term inside the expectation. Therefore, we will use the following estimated version of the policy gradient:
$$ \begin{align} \hat\grad\mathcal{L}_{\text{PG}}(\vec{\theta}) &=-\frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0} \left(\sum_{t'\geq t} \gamma^{t'}r_{i,t'+1} \right)\grad\log \pi_{\vec{\theta}}(a_{i,t}|s_{i,t}) \\ &=-\frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0} \hat{q}_{i,t} \grad\log \pi_{\vec{\theta}}(a_{i,t}|s_{i,t}). \end{align} $$Note the use of the notation $\hat{q}_{i,t}$ to represent the estimated action-value at time $t$ in the sampled trajectory $i$. Here $\hat{q}_{i,t}$ is acting as the weight-term for the policy gradient.
TODO: Complete the implementation of the VanillaPolicyGradientLoss class in the hw4/rl_pg.py module.
# Ensure deterministic run
env = gym.make(ENV_NAME)
env.seed(SEED)
torch.manual_seed(SEED)
def agent_fn():
# Use a simple "network" here, so that this test doesn't depend on
# your specific PolicyNet implementation
p_net_test = nn.Linear(ENV_N_OBSERVATIONS, ENV_N_ACTIONS, bias=True)
agent = hw4pg.PolicyAgent(env, p_net_test)
return agent
dataloader = hw4data.TrainBatchDataset(agent_fn, gamma=0.9, episode_batch_size=4)
test_batch = next(iter(dataloader))
test_action_scores = torch.randn(len(test_batch), env.action_space.n)
print(f"{test_batch=}", end='\n\n')
print(f"test_action_scores=\n{test_action_scores}\nshape={test_action_scores.shape}", end='\n\n')
loss_fn_p = hw4pg.VanillaPolicyGradientLoss()
loss_p, _ = loss_fn_p(test_batch, test_action_scores)
print(f'{loss_p=}')
test.assertAlmostEqual(loss_p.item(), -36.642, delta=1e-2)
test_batch=TrainBatch(states: torch.Size([388, 8]), actions: torch.Size([388]), q_vals: torch.Size([388])), num_episodes: 4)
test_action_scores=
tensor([[-7.8158e-01, -1.5094e-01, 4.1993e-01, 1.4059e+00],
[ 1.3185e+00, 2.8492e-05, 1.4242e+00, -6.2004e-01],
[ 1.1881e+00, 2.0556e+00, 6.6027e-01, -1.1078e+00],
...,
[-3.4807e-01, -9.6609e-01, -6.5453e-01, -2.0421e+00],
[-1.1453e+00, 8.8890e-01, 2.4767e-01, 9.7610e-01],
[-1.2878e+00, 1.9194e+00, -9.3162e-01, -3.8786e-01]])
shape=torch.Size([388, 4])
loss_p=tensor(-36.6422)
Another way to reduce the variance of our gradient is to use relative weighting of the log-policy instead of absolute reward values. $$ \hat\grad\mathcal{L}_{\text{BPG}}(\vec{\theta}) =-\frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0} \left(\hat{q}_{i,t}-b\right) \grad\log \pi_{\vec{\theta}}(a_{i,t}|s_{i,t}). $$ In other words, we don't measure a trajectory's worth by it's total reward, but by how much better that total reward is relative to some expected ("baseline") reward value, denoted above by $b$. Note that subtracting a baseline has no effect on the expected value of the policy gradient. It's easy to prove this directly by definition.
Here we'll implement a very simple baseline (not optimal in terms of variance reduction): the average of the estimated state-values $\hat{q}_{i,t}$.
TODO: Complete the implementation of the BaselinePolicyGradientLoss class in the hw4/rl_pg.py module.
# Using the same batch and action_scores from above cell
loss_fn_p = hw4pg.BaselinePolicyGradientLoss()
loss_p, loss_dict = loss_fn_p(test_batch, test_action_scores)
print(f'{loss_dict=}')
test.assertAlmostEqual(loss_dict['baseline'], -22.191, delta=1e-2)
test.assertAlmostEqual(loss_p.item(), -0.278, delta=1e-2)
loss_dict={'loss_p': -0.2786746621131897, 'baseline': -22.191144943237305}
The entropy of a probability distribution (in our case the policy), is $$ H(\pi) = -\sum_{a} \pi(a|s)\log\pi(a|s). $$ The entropy is always positive and obtains it's maximum for a uniform distribution. We'll use the entropy of the policy as a bonus, i.e. we'll try to maximize it. The idea is the prevent the policy distribution from becoming too narrow and thus promote the agent's exploration.
First, we'll calculate the maximal possible entropy value of the action distribution for a set number of possible actions. This will be used as a normalization term.
TODO: Complete the implementation of the calc_max_entropy() method in the ActionEntropyLoss class.
loss_fn_e = hw4pg.ActionEntropyLoss(env.action_space.n)
print('max_entropy = ', loss_fn_e.max_entropy)
test.assertAlmostEqual(loss_fn_e.max_entropy, 1.38629436, delta=1e-3)
max_entropy = 1.3862943611198906
TODO: Complete the implementation of the forward() method in the ActionEntropyLoss class.
loss_e, _ = loss_fn_e(test_batch, test_action_scores)
print('loss = ', loss_e)
test.assertAlmostEqual(loss_e.item(), -0.8103, delta=1e-2)
loss = tensor(-0.8103)
We'll implement our training procedure as follows:
This is known as the REINFORCE algorithm.
Fortunately, we've already implemented everything we need for steps 1-4 so we need only a bit more code to put it all together.
The following block implements a wrapper, train_pg to create all the objects we need in order to train our policy gradient model.
import hw4.answers
from functools import partial
ENV_NAME = "Beresheet-v2"
def agent_fn_train(agent_type, p_net, seed, envs_dict):
winfo = torch.utils.data.get_worker_info()
wid = winfo.id if winfo else 0
seed = seed + wid if seed else wid
env = gym.make(ENV_NAME)
envs_dict[wid] = env
env.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
return agent_type(env, p_net)
def train_rl(agent_type, net_type, loss_fns, hp, seed=None, checkpoints_file=None, **train_kw):
print(f'hyperparams: {hp}')
envs = {}
p_net = net_type(ENV_N_OBSERVATIONS, ENV_N_ACTIONS, **hp)
p_net.share_memory()
agent_fn = partial(agent_fn_train, agent_type, p_net, seed, envs)
dataset = hw4data.TrainBatchDataset(agent_fn, hp['batch_size'], hp['gamma'])
dataloader = DataLoader(
dataset, batch_size=None, num_workers=hp['num_workers'],
multiprocessing_context='fork' if hp['num_workers'] > 0 else None
)
optimizer = optim.Adam(p_net.parameters(), lr=hp['learn_rate'], eps=hp['eps'])
trainer = hw4pg.PolicyTrainer(p_net, optimizer, loss_fns, dataloader, checkpoints_file)
try:
trainer.train(**train_kw)
except KeyboardInterrupt as e:
print('Training interrupted by user.')
finally:
for env in envs.values():
env.close()
# Include final model state
training_data = trainer.training_data
training_data['model_state'] = p_net.state_dict()
return training_data
def train_pg(baseline=False, entropy=False, **train_kwargs):
hp = hw4.answers.part1_pg_hyperparams()
loss_fns = []
if baseline:
loss_fns.append(hw4pg.BaselinePolicyGradientLoss())
else:
loss_fns.append(hw4pg.VanillaPolicyGradientLoss())
if entropy:
loss_fns.append(hw4pg.ActionEntropyLoss(ENV_N_ACTIONS, hp['beta']))
return train_rl(hw4pg.PolicyAgent, hw4pg.PolicyNet, loss_fns, hp, **train_kwargs)
The PolicyTrainer class implements the training loop, collects the losses and rewards and provides some useful checkpointing functionality.
The training loop will generate batches of episodes and train on them until either:
running_mean_len episodes is greater than the target_reward, ORmax_episodes.Most of this class is already implemented for you.
TODO:
train_batch() method of the PolicyTrainer.part1_pg_hyperparams() function within the hw4/answers.py module as needed. You get some sane defaults.Let's check whether our model is actually training. We'll try to reach a very low (bad) target reward, just as a sanity check to see that training works. Your model should be able to reach this target reward within a few batches.
You can increase the target reward and use this block to manually tweak your model and hyperparameters a few times.
target_reward = -140 # VERY LOW target
train_data = train_pg(target_reward=target_reward, seed=SEED, max_episodes=2000, running_mean_len=10)
test.assertGreater(train_data['mean_reward'][-1], target_reward)
hyperparams: {'batch_size': 32, 'gamma': 0.98, 'beta': 0.5, 'learn_rate': 0.003, 'eps': 1e-08, 'num_workers': 0}
=== Training...
#7: step=00019752, loss_p=-60.57, m_reward(10)=-138.5 (best=-166.1): 13%|█▎ | 256/2000 [00:10<01:13, 23.79it/s]
=== 🚀 SOLVED - Target reward reached! 🚀
We'll now run a few experiments to see the effect of diferent loss functions on the training dynamics. Namely, we'll try:
vpg): No baseline, no entropybpg): Baseline, no entropy lossepg): No baseline, with entropy losscpg): Baseline, with entropy lossfrom collections import namedtuple
from pprint import pprint
import itertools as it
ExpConfig = namedtuple('ExpConfig', ('name','baseline','entropy'))
def exp_configs():
exp_names = ('vpg', 'epg', 'bpg', 'cpg')
z = zip(exp_names, it.product((False, True), (False, True)))
return (ExpConfig(n, b, e) for (n, (b, e)) in z)
pprint(list(exp_configs()))
[ExpConfig(name='vpg', baseline=False, entropy=False), ExpConfig(name='epg', baseline=False, entropy=True), ExpConfig(name='bpg', baseline=True, entropy=False), ExpConfig(name='cpg', baseline=True, entropy=True)]
We'll save the training data from each experiment for plotting.
import pickle
def dump_training_data(data, filename):
os.makedirs(os.path.dirname(filename), exist_ok=True)
with open(filename, mode='wb') as file:
pickle.dump(data, file)
def load_training_data(filename):
with open(filename, mode='rb') as file:
return pickle.load(file)
Let's run the experiments! We'll run each configuration for a fixed number of episodes so that we can compare them.
Notes:
force_run to True.import math
exp_max_episodes = 4000
results = {}
training_data_filename = os.path.join('results', f'part1_exp.dat')
# Set to True to force re-run (careful! will delete old experiment results)
force_run = False
# Skip running if results file exists.
if os.path.isfile(training_data_filename) and not force_run:
print(f'=== results file {training_data_filename} exists, skipping experiments.')
results = load_training_data(training_data_filename)
else:
for n, b, e in exp_configs():
print(f'=== Experiment {n}')
results[n] = train_pg(baseline=b, entropy=e, max_episodes=exp_max_episodes, post_batch_fn=None)
dump_training_data(results, training_data_filename)
=== results file results/part1_exp.dat exists, skipping experiments.
def plot_experiment_results(results, fig=None):
if fig is None:
fig, _ = plt.subplots(nrows=2, ncols=2, sharex=True, figsize=(18,12))
for i, plot_type in enumerate(('loss_p', 'baseline', 'loss_e', 'mean_reward')):
ax = fig.axes[i]
for exp_name, exp_res in results.items():
if plot_type not in exp_res:
continue
ax.plot(exp_res['episode_num'], exp_res[plot_type], label=exp_name)
ax.set_title(plot_type)
ax.set_xlabel('episode')
ax.legend()
return fig
experiments_results_fig = plot_experiment_results(results)
You should see positive training dynamics in the graphs (reward going up). If you don't, use them to further update your model or hyperparams.
To pass the test, you'll need to get a best total mean reward of at least 10 in the fixed number of epochs using the combined loss. It's possible to get much higher (over 100).
best_cpg_mean_reward = max(results['cpg']['mean_reward'])
print(f'Best CPG mean reward: {best_cpg_mean_reward:.2f}')
test.assertGreater(best_cpg_mean_reward, 10)
Best CPG mean reward: 143.87
Now let's take a look at a gameplay video of our cpg model after the short training!
hp = hw4.answers.part1_pg_hyperparams()
p_net_cpg = hw4pg.PolicyNet.build_for_env(env, **hp)
p_net_cpg.load_state_dict(results['cpg']['model_state'])
env, n_steps, reward = hw4pg.PolicyAgent.monitor_episode(ENV_NAME, p_net_cpg)
print(f'{n_steps} steps, total reward: {reward:.2f}')
show_monitor_video(env)
300 steps, total reward: 119.36
We have seen that the policy-gradient loss can be interpreted as a log-likelihood of the policy term (selecting a specific action at a specific state), weighted by the future rewards of that choice of action.
However, naïvely weighting by rewards has significant drawbacks in terms of the variance of the resulting gradient. We addressed this by adding a simple baseline term which represented our "expected reward" so that we increase probability of actions leading to trajectories which exceed this expectation and vice-versa.
In this part we'll explore a more powerful baseline, which is the idea behind the AAC method.
Recall the definition of the state-value function $v_{\pi}(s)$ and action-value function $q_{\pi}(s,a)$:
$$ \begin{align} v_{\pi}(s) &= \E{g(\tau)|s_0 = s,\pi} \\ q_{\pi}(s,a) &= \E{g(\tau)|s_0 = s,a_0=a,\pi}. \end{align} $$Both these functions represent the value of the state $s$. However, $v_\pi$ averages over the first action according to the policy, while $q_\pi$ fixes the first action and then continues according to the policy.
Their difference is known as the advantage function: $$ a_\pi(s,a) = q_\pi(s,a)-v_\pi(s). $$
If $a_\pi(s,a)>0$ it means that it's better (in expectation) to take action $a$ in state $s$ compared to the average action. In other words, $a_\pi(s,a)$ represents the advantage of using action $a$ in state $s$ compared to the others.
So far we have used an estimate for $q_\pi$ as our weighting term for the log-policy, with a fixed baseline per batch.
$$ \hat\grad\mathcal{L}_{\text{BPG}}(\vec{\theta}) =-\frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0} \left(\hat{q}_{i,t}-b\right) \grad\log \pi_{\vec{\theta}}(a_{i,t}|s_{i,t}). $$Now, we will use the state value as a baseline, so that an estimate of the advantage function is our weighting term:
$$ \hat\grad\mathcal{L}_{\text{AAC}}(\vec{\theta}) =-\frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0} \left(\hat{q}_{i,t}-v_\pi(s_t)\right) \grad\log \pi_{\vec{\theta}}(a_{i,t}|s_{i,t}). $$Intuitively, using the advantage function makes sense because it means we're weighting our policy's actions according to how advantageous they are compared to other possible actions.
But how will we know $v_\pi(s)$? We'll learn it of course, using another neural network. This is known as actor-critic learning. We simultaneously learn the policy (actor) and the value of states (critic). We'll treat it as a regression task: given a state $s_t$, our state-value network will output $\hat{v}_\pi(s_t)$, an estimate of the actual unknown state-value. Our regression targets will be the discounted rewards, $\hat{q}_{i,t}$ (see question 2), and we can use a simple MSE as the loss function, $$ \mathcal{L}_{\text{SV}} = \frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0}\left(\hat{v}_\pi(s_t) - \hat{q}_{i,t}\right)^2. $$
We'll build heavily on our implementation of the regular policy-gradient method, and just add a new model class and a new loss class, with a small modification to the agent.
Let's start with the model. It will accept a state, and return action scores (as before), but also the value of that state. You can experiment with a dual-head network that has a shared base, or implement two separate parts within the network.
TODO:
AACPolicyNet class in the hw4/rl_ac.py module.part1_aac_hyperparams() function of the hw4.answers module.import hw4.rl_ac as hw4ac
hp = hw4.answers.part1_aac_hyperparams()
pv_net = hw4ac.AACPolicyNet.build_for_env(env, device, **hp)
pv_net
AACPolicyNet(
(actions_model): Sequential(
(0): Linear(in_features=8, out_features=512, bias=True)
(1): ReLU()
(2): Linear(in_features=512, out_features=256, bias=True)
(3): ReLU()
(4): Linear(in_features=256, out_features=4, bias=True)
)
(values_model): Sequential(
(0): Linear(in_features=8, out_features=256, bias=True)
(1): ReLU()
(2): Linear(in_features=256, out_features=128, bias=True)
(3): ReLU()
(4): Linear(in_features=128, out_features=1, bias=True)
)
)
TODO: Complete the implementation of the agent class, AACPolicyAgent, in the hw4/rl_ac.py module.
agent = hw4ac.AACPolicyAgent(env, pv_net, device)
exp = agent.step()
test.assertIsInstance(exp, hw4pg.Experience)
print(exp)
Experience(state=tensor([-0.0033, 1.4164, -0.3389, 0.2454, 0.0039, 0.0768, 0.0000, 0.0000]), action=3, reward=1.603753093962781, is_done=False)
TODO: Implement the AAC loss function as the class AACPolicyGradientLoss in the hw4/rl_ac.py module.
loss_fn_aac = hw4ac.AACPolicyGradientLoss(delta=1.)
test_state_values = torch.ones(test_action_scores.shape[0], 1)
loss_t, loss_dict = loss_fn_aac(test_batch, (test_action_scores, test_state_values))
print(f'{loss_dict=}')
test.assertAlmostEqual(loss_dict['adv_m'], -23.191, delta=1e-2)
test.assertAlmostEqual(loss_t.item(), 1183.948, delta=1e-2)
loss_dict={'loss_p': -38.280941009521484, 'loss_v': 1222.228759765625, 'adv_m': -23.19112777709961}
Let's run the same experiment as before, but with the AAC method and compare the results.
def train_aac(baseline=False, entropy=False, **train_kwargs):
hp = hw4.answers.part1_aac_hyperparams()
loss_fns = [hw4ac.AACPolicyGradientLoss(hp['delta']), hw4pg.ActionEntropyLoss(ENV_N_ACTIONS, hp['beta'])]
return train_rl(hw4ac.AACPolicyAgent, hw4ac.AACPolicyNet, loss_fns, hp, **train_kwargs)
training_data_filename = os.path.join('results', f'part1_exp_aac.dat')
# Set to True to force re-run (careful, will delete old experiment results)
force_run = False
if os.path.isfile(training_data_filename) and not force_run:
print(f'=== results file {training_data_filename} exists, skipping experiments.')
results_aac = load_training_data(training_data_filename)
else:
print(f'=== Running AAC experiment')
training_data = train_aac(max_episodes=exp_max_episodes)
results_aac = dict(aac=training_data)
dump_training_data(results_aac, training_data_filename)
=== results file results/part1_exp_aac.dat exists, skipping experiments.
experiments_results_fig = plot_experiment_results(results)
plot_experiment_results(results_aac, fig=experiments_results_fig);
You should get better results with the AAC method, so this time the bar is higher (again, you should aim for a mean reward of 100+). Compare the graphs with combined PG method and see if they make sense.
best_aac_mean_reward = max(results_aac['aac']['mean_reward'])
print(f'Best AAC mean reward: {best_aac_mean_reward:.2f}')
test.assertGreater(best_aac_mean_reward, 50)
Best AAC mean reward: 135.22
Now, using your best model and hyperparams, let's train model for much longer and see the performance. Just for fun, we'll also visualize an episode every now and then so that we can see how well the agent is playing.
TODO:
_final to the file name.
This will cause the block to skip training and instead load your saved model when running the homework submission script.
Note that your submission zip file will not include the checkpoint file. This is OK.import IPython.display
CHECKPOINTS_FILE = f'checkpoints/{ENV_NAME}-ac.dat'
CHECKPOINTS_FILE_FINAL = f'checkpoints/{ENV_NAME}-ac_final.dat'
TARGET_REWARD = 125
MAX_EPISODES = 15_000
def post_batch_fn(batch_idx, p_net, batch, print_every=20, final=False):
if not final and batch_idx % print_every != 0:
return
env, n_steps, reward = hw4ac.AACPolicyAgent.monitor_episode(ENV_NAME, p_net)
html = show_monitor_video(env, width="500")
IPython.display.clear_output(wait=True)
print(f'Monitor@#{batch_idx}: n_steps={n_steps}, total_reward={reward:.3f}, final={final}')
IPython.display.display_html(html)
if os.path.isfile(CHECKPOINTS_FILE_FINAL):
print(f'=== {CHECKPOINTS_FILE_FINAL} exists, skipping training...')
checkpoint_data = torch.load(CHECKPOINTS_FILE_FINAL)
hp = hw4.answers.part1_aac_hyperparams()
pv_net = hw4ac.AACPolicyNet.build_for_env(env, **hp)
pv_net.load_state_dict(checkpoint_data['params'])
print(f'=== Running best model...')
env, n_steps, reward = hw4ac.AACPolicyAgent.monitor_episode(ENV_NAME, pv_net)
print(f'=== Best model ran for {n_steps} steps. Total reward: {reward:.2f}')
IPython.display.display_html(show_monitor_video(env))
best_mean_reward = checkpoint_data["best_mean_reward"]
else:
print(f'=== Starting training...')
train_data = train_aac(TARGET_REWARD, max_episodes=MAX_EPISODES,
seed=None, checkpoints_file=CHECKPOINTS_FILE, post_batch_fn=post_batch_fn)
print(f'=== Done, ', end='')
best_mean_reward = train_data["best_mean_reward"][-1]
print(f'num_episodes={train_data["episode_num"][-1]}, best_mean_reward={best_mean_reward:.1f}')
test.assertGreaterEqual(best_mean_reward, TARGET_REWARD)
=== checkpoints/Beresheet-v2-ac_final.dat exists, skipping training... === Running best model... === Best model ran for 300 steps. Total reward: 184.67
TODO: Answer the following questions. Write your answers in the appropriate variables in the module hw4/answers.py.
from cs236781.answers import display_answer
import hw4.answers
Explain qualitatively why subtracting a baseline in the policy-gradient helps reduce it's variance. Specifically, give an example where it helps.
display_answer(hw4.answers.part1_q1)
Your answer:
Subtracting a baseline $b$ in the policy-gradient helps to reduce variance because reducing the gradient by constant makes positive value that smaller than expected ($b$ functions as expected value) to be negative, where values that larger than the expected value will remain positive. In other words, it makes a sign-difference ($+$ and $-$) between negligible positive value and higher positive value which we aim to be at. This is effective if we want to enforce the model to learn trajectories that rewarded by values that larger than $b$
For example, if our model have some trajectories that lead to a reward of $~50$ and others to $~100$, we wish to take those trajectories that exceeded the $100$ and hence we can use $b$ that will help the model to do it. $b=90$ will change the value of $~50$ trajectories to be negative where the $~100$ will still positive and that will help us to choose this trajectories.
In AAC, when using the estimated q-values as regression targets for our state-values, why do we get a valid approximation? Hint: how is $v_\pi(s)$ expressed in terms of $q_\pi(s,a)$?
display_answer(hw4.answers.part1_q2)
Your answer:
Recall: $$ \pi(a|s)=\mathrm{Pr}(a_t=a|s_t=s) $$ $$ v_{\pi}(s) = \mathbb{E}[g_t(\tau))|s_t=s, \pi] $$ $$ q_{\pi}(s,a) = \mathbb{E}[g_t(\tau)|s_t=s, a_t = a, \pi] $$
Namely,
Therefore the relationship between $v_{\pi}(s)$ and $q_{\pi}(s,a)$ can be presented as:
$$
\begin{align}
v_{\pi}(s) &= \mathbb{E}[g_t(\tau)|s_t=s, \pi] \\ &= \sum_{a \in \mathcal{A}(s)} \pi(a|s_t=s) \mathbb{E}[g_t(\tau)|s_t=s, a_t = a, \pi]
\\ &= \sum_{a \in \mathcal{A}(s)} \pi(a|s_t=s) \cdot q_{\pi}(s,a)
\end{align}
$$
In other words, $v$ is the average over the $q$s because $\pi(a|s)=\mathrm{Pr}(a_t=a|s_t=s)$. So, we can use the action-values $q$ to approximate the state-value $v$ because it an average over the possible actions of the action-values for each action.
cpg).display_answer(hw4.answers.part1_q3)
Your answer:
Recall:
vpg): No baseline, no entropy loss ($\hat\grad\mathcal{L}_{\text{PG}}(\vec{\theta})$)bpg): Baseline, no entropy loss ($\hat\grad\mathcal{L}_{\text{BPG}}(\vec{\theta})$)epg): No baseline, with entropy loss ($\hat\grad\mathcal{L}_{\mathcal{H}(\pi)}(\vec{\theta})$)cpg): Baseline, with entropy loss ($\hat\grad\mathcal{L}_{\text{CPG}}(\vec{\theta})$)First experiment results analysis of vpg, epg, bpg and cpg:
1.1. loss_p: (policy-gradient as the negative average loss over trajectories)
The Methods withous baseline considering (vpg and epg) starts with relatively low negative loss_p and improves the policy during the training and exceeded the zero values.
Methods which uses baselint subtraction (bpg and cpg) shows small changess around $0$. It doesn't meants that the policy don't improve as shown it other graph, thats because the baseline subtraction.
1.2. loss_e: (negative entropy loss)
This graph is relevant only for the methods that uses entropy loss - epg which uses $\hat\grad\mathcal{L}_{\mathcal{H}(\pi)}(\vec{\theta})$ and cpg which uses $\hat\grad\mathcal{L}_{\text{CPG}}(\vec{\theta})$.
The (negative) values of both methods rise as the networks process progress (get closer to zero).
High entropy values (here, bigger in absolute value) means that the action probability distribution similar to uniform distribution, and reduction in entropy hint that the network converges to good policy and the network act more confidentlly.
The subtraction of baseline $b$ in cpg speeds up the convergence rate, a good way to see it is the similar "undershoot" of both graphs, that happens earlier in cpg (around episode 500) compare to epg (around episode 1,000). Hence, cpg method is more efficient and has better values.
1.3. baseline: (the baseline $b$) The $b$ values increases with every batch, it starts faster and converages slowly. Higher $b$ helps the network to increase the rewards. Here also, we can see that the method with baseline converges faster.
1.4. mean_reward:
This graph shows an increase in mean reward for all methods.
Thus, we can conclude that all the loss functions fit this problem (after choosing good hyperparameters).
Comparting the method with and witout the entropy (vpg vs. epg and bpg vs. cpg), we can see that at the entropy helps the process without baseline, but with the baseline the graphs look pretty similar.
Additionally, as mentioned before, using baseline helps the training to converge faster (comparing vpg vs. bpg and epg vs. cpg) - the mean reward is much higher at the earlier episodes but in higher episodes the the curves reach pretty similar values.
Comparison between regular policy-gradient cpg and advantage actor-critic AAC:
2.1. loss_p:
AAC starts with lower values (similar to methods withous baseline) but quickly improve and reches higher values than cpg, which means a lower trajectory loss. We can conclude that AAC is better model and approximate better the state value (better baseline).
2.2. loss_e:
The AAC method here has more fluctuating curve, which probably depend on hyperparameters. But, we can see here that the AAC entropy loss reaches smaller absolute entropy loss, and hence has better results compare to cpg. The difference strongly depend on the baseline choose, because this is the only different between $\hat\grad\mathcal{L}_{\text{CPG}}(\vec{\theta})$ and $\hat\grad\mathcal{L}_{\text{AAC}}(\vec{\theta})$.
2.3. baseline: not relevant for AAC, same as 1.3.
2.4. mean_reward:
shows similar performance for AAC and cpg at the earlier episodes but later the AAC shows a decrease (which looks like at the same point of the instability at the entropy loss), but it look like it fix itself at the end. We can say that this task is probably simple enought that the cpg get rewards which are not worse than the AAC.
This section contains summary questions about various topics from the course material.
You can add your answers in new cells below the questions.
Notes
Answer:
Receptive Field, of a biological neuron is the portion of the sensory space that can elicit neuronal responses, when stimulated. In deep learning context, the receptive field is defined as the size of the region in the input that produces the feature. As shown in tutorial 4 using the figure below, in CNNs each pixel in each layer may has information from the previous layer, depending on the architecture.
Answer:
We can control the receptive field size using:
| (A) No paading, Stride = Dilation = 1 | (B) Padding = Stride = Dilation = 1 | (C) Stride = 2, Padding = Dilation = 1 | (D) No padding, Stride = 1, Dilation = 2 |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
import torch
import torch.nn as nn
cnn = nn.Sequential(
nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(in_channels=4, out_channels=16, kernel_size=5, stride=2, padding=2),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(in_channels=16, out_channels=32, kernel_size=7, dilation=2, padding=3),
nn.ReLU(),
)
cnn(torch.rand(size=(1, 3, 1024, 1024), dtype=torch.float32)).shape
torch.Size([1, 32, 122, 122])
What is the size (spatial extent) of the receptive field of each "pixel" in the output tensor?
Answer:
The receptive field for 1D input can be calculated using the next recursive formula: $RF_{l-1} = s_l \cdot (RF_l - 1) + d_l \cdot (k_l - 1) + 1$ where $RF_i$ is the receptive field in the layer $i$ and $RF_L=$ where $L$ is the last layer, $k_l$ is the kernel size, $s_i$ is the stride, $d_l$ is the dilation factor. Note that layers like ReLU doesn't affect the $RF$.
We assume that the padding isn't affect the $RF$ and they just used to correct edges effects, as we have here. The explantion for this recursive formula is:
Summing this two terms, provided the recursive formula we write at the begging of the answer. Here we have image as an input, so the anwer will be equal for both aces ($RF \times RF$). Using that formula, we can recursively calculate the original $RF$ of the architecture above, which denoted here as $RF_0$:
$ \text{ReLU layers:} \ RF_1 = RF_2 ; RF_7 = RF_8 ; RF_4 = RF_5 \\ \text{Final layer:} \ RF_8 = 1 \\ $
$ r_0 = 1 \cdot (RF_1 - 1) + 1 \cdot (3 - 1) + 1 = RF_1 + 2 = RF_2 + 2 =\\ = ( 2 \cdot (RF_3 - 1) + 1 \cdot (2 - 1) + 1 ) + 2 = \dots = 2 \cdot RF_3 + 2 =\\ = 2 \cdot ( 2 \cdot (RF_4 - 1) + 1 \cdot (5 - 1) + 1 ) + 2 = 4 \cdot RF_4 + 8 = 4 \cdot RF_5 + 8 = \\ = 4 \cdot ( 2 \cdot (RF_6 - 1) + 1 \cdot (2 - 1) + 1 ) + 8 = 8 \cdot RF_6 + 8 =\\ = 8 \cdot ( 1 \cdot (RF_7 - 1) + 2 \cdot (7 - 1) + 1 ) + 8 = 8 \cdot RF_7 + 104 = 8 \cdot RF_8 + 104 \\ \\ \Longrightarrow RF_0 = 8 + 104 = 112 $
In conclusion, for each axis we have $RF$ of 112, and we work on image and equal operation for both axes, so we have got $RF_0=112 \times 112$
You have trained a CNN, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$, and $f_l(\cdot;\vec{\theta}_l)$ is a convolutional layer (not including the activation function).
After hearing that residual networks can be made much deeper, you decide to change each layer in your network you used the following residual mapping instead $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)+\vec{x}$, and re-train.
However, to your surprise, by visualizing the learned filters $\vec{\theta}_l$ you observe that the original network and the residual network produce completely different filters. Explain the reason for this.
Answer:
Basically, changing the network architecture is going to affect the weights values that optimizes the loss function. For example, if we use the weights learned in the original network, in the residual network sense each layer is a different function (becuase we are adding $x$ to the function) obviously we will get the wrong output. Therefore, the weights which optimize the residual network has to be adjusted to the fact that the blocks are now different (residual).
import torch.nn as nn
p1, p2 = 0.1, 0.2
nn.Sequential(
nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
nn.ReLU(),
nn.Dropout(p=p1),
nn.Dropout(p=p2),
)
Sequential( (0): Conv2d(3, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): ReLU() (2): Dropout(p=0.1, inplace=False) (3): Dropout(p=0.2, inplace=False) )
If we want to replace the two consecutive dropout layers with a single one defined as follows:
nn.Dropout(p=q)
what would the value of q need to be? Write an expression for q in terms of p1 and p2.
Answer:
The nn.Dropout(p) layer randomly zeroes some elements of the input tensor with probability $p$ using samples from a Bernoulli distribution. Therfore after sequantial of nn.Dropout(p=p1) $\rightarrow$ nn.Dropout(p=p2) an element will zero if it zeroed in the first dropout layer $(p_1)$ **OR** in the second dropout layer $(p_2)$ (remember that they statistically independent).
Therefore, an equivalent layer nn.Dropout(p=q) will need to be with probability $q = Pr(dropout_1 \cup dropout_2) = p_1 + p_2 - p_1 \cdot p_2 = 0.1 + 0.2 - 0.1 \cdot 0.2 = 0.28$.
Simplier way to think of it calculate it as a complementary event, i.e., the probability of neuron to survive is a dropout layer is $1-p$ so what we need is the complementary even of both $(1-p_1)$ **AND** $(1-p_2)$ and hence the equivalent is: $q = 1-Pr(\overline{dropout}_1 \cap \overline{dropout}_2) = 1-(1-p_1) \cdot (1-p_2) = 1-0.9 \cdot 0.8 = 0.28$
Answer:
False. Usually, the dopout is applied after the activation function, but it is not a must. For example, considering the ReLU activation function, it makes more sense to apply the dropout before because $ReLU(0) = 0$, so it more computationally efficient. However, it is not a necessary.
Answer: The dropout activation can be written as: $$ y_{dropout} = f(x)= \begin{cases} x, & \text{with probability of} \ 1-p \\ 0, & \text{with probability of} \ p \end{cases} $$
Let's calculate the expectation value $\mathbb{E}[y_{dropout}] = p\cdot 0 +(1-p) \cdot x = (1-p) \cdot x $ It easy to see that without the drop we have $\mathbb{E}[y_{without dropout}] = x$ and we got a scale factor of: $\frac{\mathbb{E}[y_{without dropout}]}{\mathbb{E}[y_{dropout}] } = \frac{1}{1-p} $
Answer: We wouldn't choose L2 loss in this case, bacause it won't strongly penelize for wrong classification. We would choose a binary log cross entropy instead.
We will explain it with a numerical example: let assume that our classifier is completly wrong to classify a hotdog (1), that means it predict 0 (dog). So, the L2 loss will be $(1-0)^2 = 1$, where the binary cross entropy will be $- (0 \cdot log(1) + 1 \cdot log(0)) \longrightarrow \infty$.
In addition, using MSE means that we assume that the underlying data has been generated from a normal distribution. In Bayesian terms this means we assume a Gaussian prior. While in reality, a dataset that can be classified into two categories (i.e. binary) is not from a normal distribution but a Bernoulli distribution. Besides, the MSE function is non-convex for binary classification. In simple terms, if a binary classification model is trained with MSE Cost function, it is not guaranteed to minimize the Cost function. This is because MSE function expects real-valued inputs in range($-\infty$, $\infty$), while binary classification models output probabilities in $[0,1]$ through the sigmoid/logistic function.

You decide to train a cutting-edge deep neural network regression model, that will predict the global temperature based on the population of pirates in N locations around the globe.
You define your model as follows:
import torch.nn as nn
N = 42 # number of known global pirate hot spots
H = 128
mlpirate = nn.Sequential(
nn.Linear(in_features=N, out_features=H),
nn.Sigmoid(),
*[
nn.Linear(in_features=H, out_features=H), nn.Sigmoid(),
]*24,
nn.Linear(in_features=H, out_features=1),
)
# print(mlpirate)
While training your model you notice that the loss reaches a plateau after only a few iterations. It seems that your model is no longer training. What is the most likely cause?
Answer:
We assume that the most likely cause for this is the "Vanishing Gradients" problem, that happens when the gradients calculated during backpropagation are small and that can make the neuron to have no change in the values. It make sense that this is the problem due to the fact that model is too deep and has a lot of sigmoid layers (25*(FC->Sigmoid)->FC). As we can see at the figure added below, the sigmoid gradient is $\in [0, \sim 0.25]$, so it make sense why the multipication of that 25 times will causes the gradients to vanish.
sigmoid activations with tanh, it will solve your problem. Is he correct? Explain why or why not.Answer: No, he is wrong. Although the $tanh$ derivative is $\in [0,1]$ which is better, it still has gradients that less than $1$ value, and the fact that the network architecture is still too deep and the vanishing gradient problem will not be solved.
Answer:
A. True. The deriviative of ReLU is
$
\frac{d}{dx}ReLU(x) = \frac{d}{dx}max(0,x) =
\begin{cases}
1, & x>0 \\
0, & x \leq 0 \ \text{($x=0$ is undefined but set to 0))}
\end{cases}
$
Vanishing gradint can happens only when the derivative of the activation function is $\in (0, 1)$, because we can get kind of geomeric seriest which convergence to 0.
B. True. As mention in the previous section, for positive input the gradient is constantly $1$, so it is linear for positive input $x$.
C. True. It is possible since negative inputs ReLU will outputs $0$ that can make the activation remain a value of $0$.
Answer:
All the GD methods are iterative methods, which updates the parameters in a way that reduces the function value untill we reach a minimum.
Answer:
A. SGD is more often used in practice becuase:
In order to use GD we need to use all the training set for each update of weights. if the training set is large, it can take a long runtime to reach the mininum by using GD.
Sometimes the training set is too large to even fit into the memory (all at once). In this case we can not use GD.
B. In the case where the training data is too big to fit in memory (all at once) we can not use GD.
Answer:
We expect that the number of iterations it takes to converge to $l_0$ will decrease - we will converge faster. Basically, as a rule of thumb, larger batch size create more robust and accurate model. It can be seen from the image in question #1 in this section, that increasing the batch size will decrease the number of iterations to convergence. Since the ideal proccess is GD - which uses the whole data it converges the fastest, increasing the batch size will get us closer to the ideal proccess, and therefore we will converge faster.
Answer:
A. True - SGD is using randomly one sample and avoids repeating the same sample, thus it iterates over the whole dataset in each epoch. Sometimes, this process is described by shuffling the whole database every epoch and using samples one after the other by the predefined order.
B. False - Exactly the opposite. Since it uses only one example from the data, the update process is vety noisy with high variance. In the expectation SGD converges to GD, but when acctually performing SGD we might take steps "the wrong way", but on average we are progressing towards the minumum. so the variance is greater in SGD.
C. True - If we are currently in a local minima, GD is stuck, because when using GD we compute the exact gradient of the function, which is 0. But, SGD might get itself out of the local minima, since it uses only some of the sample and not the whole data set. So the gradient is not exatly 0, so if we keep iterating we can escape this local minima. In other words, the noise we described in section B here, might help the SGD in this case to get out of the local minima.
D. False - Using GD we need all our dataset for each iteration. Using SGD we just use a minibatch (or maybe even one sample) each iteration.
E. False - Both methods guarantees to converge only to a local minima. If the function is not convex the minimal value we converge to depends on the initialization of the parameters - and it may be a local minima.
F. True - Using momentum can help in this case because the momentum accumulates the gradients from previous iterations (from the beginning until current step with a decay factor), if the gradient in current iteration has the same direction as the momentum, then the weight-update in the same direction will have larger size step and thus in our case it will coverge faster. See visuallization below. On the other hand, by using Newton's method we need to calculate the Hessian, which is $N \times N$ matrix, and with high curvature it won't converge quickly.
Answer:
False - As shown in the tutorial, we want to find $\tilde{y}$ such that: $$\tilde{y}=argmin_y f(y; z)$$
We also know the following:
If the gradient of $f$ is $0$ for a given $y$, the perturbing $z \rightarrow z+dz$ will "move" the minimum to different $y+dy$.
Therefore, we can write using Taylor: $$\nabla_y f(y+dy,z+dz) \approx \nabla_y f(y,z)+\nabla^2_{yy} f(y,z)dy+\nabla^2 _{yz} f(y,z)dz=0 $$
because: $$\nabla_y f(y,z)=0$$ we obtain: $$\nabla^2_{yy} f(y,z)dy=-\nabla^2 _{yz} f(y,z)dz$$ and then we can write: $$dy =-[\nabla^2_{yy}f(y,z)]^{-1}\nabla^2 _{yz}f(y,z)dz $$ So we got the term we need in order to calculate the gradient without a decent based method.
Answer:
A. Vanishing gradients is a case where the gradient are very close to 0. exploading gradients is a case where the gradient are getting bigger and bigger and approach infinity. In either way it is hard to learn.
B. In the back propagation algorithms we multiplay the gradients. the number of gradient which are being multiplyed is proportion to the number of layers. if the gradients are $\in (0,1)$ and the number of layers is big. then they will vanish (it is a geometric series with q less then 1) - vanishing gradients. if the gradients are greater then one and the number of layers is big. then they will explode (it is a geometric series with q greater then 1) - exploading gradients.
C. Lets consider a case where we have K hidden layers. Therefore the update for the first layer is: $\pderiv{L}{W_1} = \pderiv{L}{h^k} \cdot \pderiv{h^k}{h^{k-1}} \cdot \pderiv{h^(k-1)}{h^{k-2}} ... \pderiv{h^(1)}{W_1}$ We have K+1 terms here if each term here is less than 0.1 and $q<1$, then we get: $\pderiv{L}{W_1} < 0.1^{K+1}$ which where K is big it converges to 0 - vanishing gradients. if each term here is greater than 1.1 and $q<1$, then we get: $\pderiv{L}{W_1} < 1.1^{K+1}$ which where K is big it converges to infinity - exploading gradients.
D. We can identify the case by looking at the lose funciton: if the loss is stationary we have vanishing gradiants. if the loss is changing rapidly (in each iteration) we have exploading gradients.
You wish to train the following 2-layer MLP for a binary classification task: $$ \hat{y} =\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}+ \vec{b}_1) + \vec{b}_2 $$ Your wish to minimize the in-sample loss function is defined as $$ L_{\mathcal{S}} = \frac{1}{N}\sum_{i=1}^{N}\ell(y,\hat{y}) + \frac{\lambda}{2}\left(\norm{\mat{W}_1}_F^2 + \norm{\mat{W}_2}_F^2 \right) $$ Where the pointwise loss is binary cross-entropy: $$ \ell(y, \hat{y}) = - y \log(\hat{y}) - (1-y) \log(1-\hat{y}) $$
Write an analytic expression for the derivative of the final loss $L_{\mathcal{S}}$ w.r.t. each of the following tensors: $\mat{W}_1$, $\mat{W}_2$, $\mat{b}_1$, $\mat{b}_2$, $\mat{x}$.
Answer:
w.r.t. $W_1$: $$\frac{\partial L_s}{\partial W_1} =\frac{1}{N} \sum_{i=1}^N \frac{\partial l_i}{\partial W_1} + \lambda W_1$$ Where the following stands: $$\frac{\partial l_i}{\partial W_1} = -y \frac{\partial log(\hat{y})}{\partial W_1} - (1-y) \frac{\partial log(1-\hat{y})}{\partial W_1}$$ $$\frac{\partial log(\hat{y})}{\partial W_1} = \frac{1}{\hat{y}} \frac{\partial \hat{y}}{\partial W_1}$$ $$\frac{\partial log(1-\hat{y})}{\partial W_1} = -\frac{1}{1-\hat{y}} \frac{\partial \hat{y}}{\partial W_1}$$ $$\frac{\partial \hat{y}}{\partial W_1} = (xW_2\cdot \text{diag}[\varphi'(W_1x+b_1)])^T $$ Finally: $$ \frac{\partial L_s}{\partial W_1} =\frac{1}{N} \sum_{i=1}^N \frac{\hat{y}-y}{\hat{y}(1-\hat{y})} (xW_2\cdot \text{diag}[\varphi'(W_1x+b_1)])^T)+\lambda W_1$$
w.r.t. $W_2$: $$\frac{\partial L_s}{\partial W_2} =\frac{1}{N} \sum_{i=1}^N \frac{\partial l_i}{\partial W_2} + \lambda W_2$$ Where the following stands: $$\frac{\partial l_i}{\partial W_2} = -y \frac{\partial log(\hat{y})}{\partial W_2} - (1-y) \frac{\partial log(1-\hat{y})}{\partial W_2}$$ $$\frac{\partial log(\hat{y})}{\partial W_2} = \frac{1}{\hat{y}} \frac{\partial \hat{y}}{\partial W_2}$$ $$\frac{\partial log(1-\hat{y})}{\partial W_2} = -\frac{1}{1-\hat{y}} \frac{\partial \hat{y}}{\partial W_2}$$ $$\frac{\partial \hat{y}}{\partial W_2} = \varphi^T(W_1x+b_1) $$ Finally: $$ \frac{\partial L_s}{\partial W_1} =\frac{1}{N} \sum_{i=1}^N \frac{\hat{y}-y}{\hat{y}(1-\hat{y})} \varphi^T(W_1x+b_1)+\lambda W_2$$
w.r.t. $b_1$: $$\frac{\partial L_s}{\partial b_1} =\frac{1}{N} \sum_{i=1}^N \frac{\partial l_i}{\partial b_1}$$ Where the following stands: $$\frac{\partial l_i}{\partial b_1} = -y \frac{\partial log(\hat{y})}{\partial b_1} - (1-y) \frac{\partial log(1-\hat{y})}{\partial b_1}$$ $$\frac{\partial log(\hat{y})}{\partial b_1} = \frac{1}{\hat{y}} \frac{\partial \hat{y}}{\partial b_1}$$ $$\frac{\partial log(1-\hat{y})}{\partial b_1} = -\frac{1}{1-\hat{y}} \frac{\partial \hat{y}}{\partial b_1}$$ $$\frac{\partial \hat{y}}{\partial b_1} = \text{diag}[\varphi'(W_1x+b_1)]W_2^T $$ Finally: $$ \frac{\partial L_s}{\partial W_1} =\frac{1}{N} \sum_{i=1}^N \frac{\hat{y}-y}{\hat{y}(1-\hat{y})} \text{diag}[\varphi'(W_1x+b_1)]W_2^T$$
w.r.t. $b_2$: $$\frac{\partial L_s}{\partial b_2} =\frac{1}{N} \sum_{i=1}^N \frac{\partial l_i}{\partial b_2}$$ Where the following stands: $$\frac{\partial l_i}{\partial b_2} = -y \frac{\partial log(\hat{y})}{\partial b_2} - (1-y) \frac{\partial log(1-\hat{y})}{\partial b_2}$$ $$\frac{\partial log(\hat{y})}{\partial b_2} = \frac{1}{\hat{y}} \frac{\partial \hat{y}}{\partial b_2}$$ $$\frac{\partial log(1-\hat{y})}{\partial b_2} = -\frac{1}{1-\hat{y}} \frac{\partial \hat{y}}{\partial b_2}$$ $$\frac{\partial \hat{y}}{\partial b_2} = \vec{1}$$ Finally: $$ \frac{\partial L_s}{\partial W_1} =\frac{1}{N} \sum_{i=1}^N \frac{\hat{y}-y}{\hat{y}(1-\hat{y})} $$
w.r.t. $x$: $$\frac{\partial L_s}{\partial x} =\frac{1}{N} \sum_{i=1}^N \frac{\partial l_i}{\partial x}$$ Where the following stands: $$\frac{\partial l_i}{\partial x} = -y \cdot \frac{\partial log(\hat{y})}{\partial x} - (1-y) \frac{\partial log(1-\hat{y})}{\partial x}$$ $$\frac{\partial log(\hat{y})}{\partial x} = \frac{1}{\hat{y}} \frac{\partial \hat{y}}{\partial x}$$ $$\frac{\partial log(1-\hat{y})}{\partial x} = -\frac{1}{1-\hat{y}} \frac{\partial \hat{y}}{\partial x}$$ $$\frac{\partial \hat{y}}{\partial x} = (W_2 \text{diag}[\varphi'(W_1x+b_1)]W_1)^T$$ Finally: $$ \frac{\partial L_s}{\partial W_1} =\frac{1}{N} \sum_{i=1}^N \frac{\hat{y}-y}{\hat{y}(1-\hat{y})} (W_2 \text{diag}[\varphi'(W_1x+b_1)]W_1)^T$$
Derivative a function $f(\vec{x})$ at a point $\vec{x}_0$ is $$ f'(\vec{x}_0)=\lim_{\Delta\vec{x}\to 0} \frac{f(\vec{x}_0+\Delta\vec{x})-f(\vec{x}_0)}{\Delta\vec{x}} $$
Explain how this formula can be used in order to compute gradients of neural network parameters numerically, without automatic differentiation (AD).
What are the drawbacks of this approach? List at least two drawbacks compared to AD.
Answer:
A. We can take a small number $\Delta\vec{W}$ and calculate the function value in $\vec{W}_0$ and $\vec{W}_0+\Delta\vec{W}$ (where $\vec{W}$ stands for a paramerter of NN) and compute the value of the function above. That would be the numerical derivative of the NN with respect to $\vec{W}$. We can do this calculation for each parameter $\vec{W}$ in the NN and get the gradient vector numerically without automatic differentition.
B. Two drawback are:
loss w.r.t. W and b using the approach of numerical gradients from the previous question.torch.allclose() that your numerical gradient is close to autograd's gradient.import torch
N, d = 100, 5
dtype = torch.float64
X = torch.rand(N, d, dtype=dtype)
W, b = torch.rand(d, d, requires_grad=True, dtype=dtype), torch.rand(d, requires_grad=True, dtype=dtype)
def foo(W, b):
return torch.mean(X @ W + b)
loss = foo(W, b)
print(f"{loss=}")
# TODO: Calculate gradients numerically for W and b
epsilon = 1000
W_1 = W.detach().clone()
b_1 = b.detach().clone()
if id(W) == id(W_1):
raise ValueError('shallow copy')
grad_W = torch.zeros(d, d, dtype=dtype, requires_grad=False)
grad_b = torch.zeros(d, dtype=dtype, requires_grad=False)
for i in range(d):
b_1[i] += epsilon
grad_b[i] = (foo(W_1,b_1) - foo(W,b))/epsilon
b_1[i] -= epsilon
for j in range(d):
W_1[i][j] += epsilon
grad_W[i][j] = (foo(W_1,b_1) - foo(W,b))/epsilon
W_1[i][j] -= epsilon
# TODO: Compare with autograd using torch.allclose()
loss.backward()
autograd_W = W.grad
autograd_b = b.grad
print(grad_W)
print(autograd_W)
assert torch.allclose(grad_W, autograd_W)
assert torch.allclose(grad_b, autograd_b)
loss=tensor(1.4387, dtype=torch.float64, grad_fn=<MeanBackward0>)
tensor([[0.0998, 0.0998, 0.0998, 0.0998, 0.0998],
[0.0947, 0.0947, 0.0947, 0.0947, 0.0947],
[0.1017, 0.1017, 0.1017, 0.1017, 0.1017],
[0.1014, 0.1014, 0.1014, 0.1014, 0.1014],
[0.1023, 0.1023, 0.1023, 0.1023, 0.1023]], dtype=torch.float64,
grad_fn=<CopySlices>)
tensor([[0.0998, 0.0998, 0.0998, 0.0998, 0.0998],
[0.0947, 0.0947, 0.0947, 0.0947, 0.0947],
[0.1017, 0.1017, 0.1017, 0.1017, 0.1017],
[0.1014, 0.1014, 0.1014, 0.1014, 0.1014],
[0.1023, 0.1023, 0.1023, 0.1023, 0.1023]], dtype=torch.float64)
Answer:
A. Word embeddings, is the representation of the word in a meaningful way (a vector), that encodes the meaning of the words in such a way that it can identify similarities between 2 words, by their representation in the vector space.
B. It can train, but the results would be poor, since it would not be able to identify the context of the words, moreover, it would not be able to identify the similaritier\unsimilarities between differrent words.
Y contain? why this output shape?nn.Embedding yourself using only torch tensors. import torch.nn as nn
X = torch.randint(low=0, high=42, size=(5, 6, 7, 8))
embedding = nn.Embedding(num_embeddings=42, embedding_dim=42000)
Y = embedding(X)
print(f"{Y.shape=}")
Y.shape=torch.Size([5, 6, 7, 8, 42000])
Answer:
A. num_embeddings is the number of words in the vocab.
embedding_dim is the dimention of words vector we are using.
So Y will be equal to the words embedding which is given by the indices of $X$.
The shape of Y is the same as X multiplied by the embedding dimention, because it replaces each index with a words. So $5x6x7x8$ goes to $(5x6x7x8)x42000$.
B. As we said earlier num_embeddings is the number of words in the vocab.
embedding_dim is the dimention of words vector we are using.
So I would create a NN with an imput dimention of 42, and output dimention of $42,000$. every index (a number between 0-42) will be mapped to a vector which is $42,000$ dimentional.
The mapping will be learned by training.
Answer:
A. True - The only modification to backpropagation, will be to accumulate gradients on in length S. since thats the only "short term memory" we want.
B. False - We don't have to limit the sequence length, we can just limit the number of timesteps of the backpropagation algorithm to length S.
C. True - Since we limit the gradient accumulation to length S, we would only have "memory" of at most S timesteps.
In tutorial 7 (part 2) we learned how to use attention to perform alignment between a source and target sequence in machine translation.
Explain qualitatively what the addition of the attention mechanism between the encoder and decoder does to the hidden states that the encoder and decoder each learn to generate (for their language). How are these hidden states different from the model without attention
After learning that self-attention is gaining popularity thanks to the shiny new transformer models, you decide to change the model from the tutorial: instead of the queries being equal to the decoder hidden states, you use self-attention, so that the keys, queries and values are all equal to the encoder's hidden states (with learned projections). What influence do you expect this will have on the learned hidden states?
Answer:
A. Without attention the decoder recieves the hidden state from the encoder as is. By adding attention, the decoder will focus on differente parts of the sequence. The attention (soft attention mechanism) is a weighted average of the encoder ouputs, that match the current decoder state.
B. In this case the hidden states of the decoder are not used - so they won't be learned. What is going to happen, is that the hidden states of the encoder will change - they will be similar to the meaning of the sentence.
As we have seen, a variational autoencoder's loss is comprised of a reconstruction term and a KL-divergence term. While training your VAE, you accidentally forgot to include the KL-divergence term. What would be the qualitative effect of this on:
Answer:
A. In order to reconstract the image, we only need to use both the encoder and decoder, therefore we need the KL-divergence term (which is responsible for encoding) and the reconstruction term (which is responsibale for decoding) optimized. In other words, the KL-divergence affects the $x \rightarrow z$ transition, and hence it affect the reconstruction.
B. Here the fact that we forgot to include the KL-divergence term is not going to damage us,because optimizing the KL-divergence term is responsible for the encoder, and not the decoder which generates samples. As explined before, it affects $x \rightarrow z$ transition, which is not part of the genration process.
Answer:
A. True - Thats the way we train the model - we can map the latent space distribution to any distribution.
B. False - There is a random sampling envolved in this proccess therefore it is not deterministic.
C. True - We optimize the KL-divergence term by optimizing the ELBO - evidence lower bound.
Answer:
A. False. We want both of the disctriminator and the generator to work good and have a low losses. due to the fact that they train one against the other, we want them to have "fair competition". In other words, it will be hard to make a good discriminator (low loss) to have wrong decision and hence the generator will need to do a better job.
B. False. they are being train seperatly, so when we train the discriminator, we use as input the generator's output and we do not train it.
C. True. The generator maps a latent-space variable $u\sim \mathcal{N}(0, I)$ to instance-space varibale $x$, which is an image. Therefore a parametric evidence distribution $(p_\gamma(X))$ is generated and we try to make it as closer as possible to the real evidence distribution $p(X)$.
D. True. The discriminator is helping the generator to train, so by having a discriminator who has trained for a few iterations we will get a better training for the generator.
E. False. If the discriminator has 50% accuracy, the generator will not be able to learn how to generate, because effectivly the discriminator is "helping" the generator to generate, so there is no use in training only the generator in this case.
Answer:
Answer:
In this part you'll implement a small comparative-analysis project, heavily based on the materials from the tutorials and homework.
You must choose one of the project options specified below.
project/ directory. You can import these files here, as we do for the homeworks.Based on Tutorials 6 and 7, we'll implement and train an improved sentiment analysis model. We'll use self-attention instead of RNNs and incorporate pre-trained word embeddings.
In tutorial 6 we saw that we can train word embeddings together with the model.
Although this produces embeddings which are customized to the specific task at hand,
it also greatly increases training time.
A common technique is to use pre-trained word embeddings.
This is essentially a large mapping from words (e.g. in english) to some
high-dimensional vector, such that semantically similar words have an embedding that is
"close" by some metric (e.g. cosine distance).
Use the GloVe 6B embeddings for this purpose.
You can load these vectors into the weights of an nn.Embedding layer.
In tutorial 7 we learned how attention can be used to learn to predict a relative importance for each element in a sequence, compared to the other elements. Here, we'll replace the RNN with self-attention only approach similar to Transformer models, roughly based on this paper. After embedding each word in the sentence using the pre-trained word-embedding a positional-encoding vector is added to provide each word in the sentence a unique value based on it's location. One or more self-attention layers are then applied to the results, to obtain an importance weighting for each word. Then we classify the sentence based on the average these weighted encodings.
Now, using these approaches, you need to:
Your results should include:
In HW3 we implemented a simple GANs from scratch, using an approach very similar to the original GAN paper. However, the results left much to be desired and we discovered first-hand how hard it is to train GANs due to their inherent instability.
One of the prevailing approaches for improving training stability for GANs is to use a technique called Spectral Normalization to normalize the largest singular value of a weight matrix so that it equals 1.
This approach is generally applied to the discriminator's weights in order to stabilize training. The resulting model is sometimes referred to as a SN-GAN.
See Appendix A in the linked paper for the exact algorithm. You can also use pytorch's spectral_norm.
Another very common improvement to the vanilla GAN is known a Wasserstein GAN (WGAN). It uses a simple modification to the loss function, with strong theoretical justifications based on the Wasserstein (earth-mover's) distance. See also here for a brief explanation of this loss function.
One problem with generative models for images is that it's difficult to objectively assess the quality of the resulting images. To also obtain a quantitative score for the images generated by each model, we'll use the Inception Score. This uses a pre-trained Inception CNN model on the generated images and computes a score based on the predicted probability for each class. Although not a perfect proxy for subjective quality, it's commonly used a way to compare generative models. You can use an implementation of this score that you find online, e.g. this one or implement it yourself.
Based on the linked papers, add Spectral Normalization and the Wassertein loss to your GAN from HW3. Compare between:
As a dataset, you can use LFW as in HW3 or CelebA, or even choose a custom dataset (note that there's a dataloder for CelebA in torchvision).
Your results should include:
TODO: This is where you should write your explanations and implement the code to display the results. See guidelines about what to include in this section.
As can be understood from the title, we choose to implement the GAN option. We will use the same network we have built in homework #3, and the same dataset. Based on that DCGAN, we will implement the SN-GAN which uses a spectral normalization and WGAN which uses Wasserstein loss and some more changes that will be detailed later. We will try to keep it as simple as possible, and don't change too much between each section to detect the additive ideas effect.
Note that the html file isn't run any training but use a project/<GAN_name>final.pt files. We removed all of those files, as asked, and running this notebook will provide a training process and than will print the results.
Let's start with out imports to do it.
import unittest
import os
import sys
import pathlib
import urllib
import shutil
import re
import zipfile
import pickle
import numpy as np
import torch
import matplotlib.pyplot as plt
%load_ext autoreload
%autoreload 2
test = unittest.TestCase()
plt.rcParams.update({'font.size': 12})
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
Using device: cuda
As mentioned before, we'll use the Labeled Faces in the Wild (LFW) dataset as done in HW #3. Again, we're going to train our generative model to generate a George W. Bush images (as you call it, Bush Generator 😎).
import cs236781.plot as plot
import cs236781.download
import torchvision.transforms as T
from torchvision.datasets import ImageFolder
DATA_DIR = pathlib.Path.home().joinpath('.pytorch-datasets')
DATA_URL = 'http://vis-www.cs.umass.edu/lfw/lfw-bush.zip'
_, dataset_dir = cs236781.download.download_data(out_path=DATA_DIR, url=DATA_URL, extract=True, force=False)
im_size = 64
tf = T.Compose([
# Resize to constant spatial dimensions
T.Resize((im_size, im_size)),
# PIL.Image -> torch.Tensor
T.ToTensor(),
# Dynamic range [0,1] -> [-1, 1]
T.Normalize(mean=(.5,.5,.5), std=(.5,.5,.5)),
])
ds_gwb = ImageFolder(os.path.dirname(dataset_dir), tf)
File /home/nitzanmadar/.pytorch-datasets/lfw-bush.zip exists, skipping download. Extracting /home/nitzanmadar/.pytorch-datasets/lfw-bush.zip... Extracted 531 to /home/nitzanmadar/.pytorch-datasets/lfw/George_W_Bush
Let's see what we've got.
_ = plot.dataset_first_n(ds_gwb, 50, figsize=(15,10), nrows=5)
print(f'Found {len(ds_gwb)} images in dataset folder.')
x0, y0 = ds_gwb[0]
x0 = x0.unsqueeze(0).to(device)
print(f'Images shape:{x0.shape}')
test.assertSequenceEqual(x0.shape, (1, 3, im_size, im_size))
Found 530 images in dataset folder. Images shape:torch.Size([1, 3, 64, 64])
We remind again, in the next section, the notebook will load the results if the relevant files are exist in project folder.
To train the networks again - just rename/remove thecorrespond <GAN_name>_final.pt file to something else.
As a Vanilla GAN model, we will use the same architecture from HW#3, which is a simple DCGAN.
First, as a reminder, let see the model architecture. This model will be the base of all next networks.
import project.vanilla_gan as vanilla_gan
dsc = vanilla_gan.Discriminator(in_size=x0[0].shape).to(device)
print(dsc)
d0 = dsc(x0)
print(d0.shape)
test.assertSequenceEqual(d0.shape, (1,1))
Discriminator(
(Discriminator_net): Sequential(
(0): Conv2d(3, 128, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): LeakyReLU(negative_slope=0.01)
(3): Conv2d(128, 256, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): LeakyReLU(negative_slope=0.01)
(6): Conv2d(256, 512, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(7): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): LeakyReLU(negative_slope=0.01)
(9): Conv2d(512, 1024, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(10): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(11): LeakyReLU(negative_slope=0.01)
)
(linear): Linear(in_features=16384, out_features=1, bias=True)
)
torch.Size([1, 1])
z_dim = 128
gen = vanilla_gan.Generator(z_dim, 4).to(device)
print(gen)
z = torch.randn(1, z_dim).to(device)
xr = gen(z)
print(xr.shape)
test.assertSequenceEqual(x0.shape, xr.shape)
Generator(
(unlinear): Linear(in_features=128, out_features=16384, bias=True)
(Decoder): Sequential(
(0): ConvTranspose2d(1024, 512, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): LeakyReLU(negative_slope=0.01)
(3): ConvTranspose2d(512, 256, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
(4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): LeakyReLU(negative_slope=0.01)
(6): ConvTranspose2d(256, 128, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
(7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): LeakyReLU(negative_slope=0.01)
(9): ConvTranspose2d(128, 3, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
(10): Tanh()
)
)
torch.Size([1, 3, 64, 64])
The next cell is for training or load the final file. This should be the same training and results from HW#3.
In addition, we have added some simple code to save the losses curves, this we be also used in next sections to see the different train losses curves.
import project.vanilla_gan as gan
import torch.optim as optim
from torch.utils.data import DataLoader
from project.hyperparameters import vanilla_gan_hyperparams
from project.inception_score import inception_score
torch.manual_seed(42)
# Hyperparams
hp = vanilla_gan_hyperparams()
skip = False
batch_size = hp['batch_size']
z_dim = hp['z_dim']
# Data
dl_train = DataLoader(ds_gwb, batch_size, shuffle=True)
im_size = ds_gwb[0][0].shape
# Model
dsc = gan.Discriminator(im_size).to(device)
gen = gan.Generator(z_dim, featuremap_size=4).to(device)
# Optimizer
def create_optimizer(model_params, opt_params):
opt_params = opt_params.copy()
optimizer_type = opt_params['type']
opt_params.pop('type')
return optim.__dict__[optimizer_type](model_params, **opt_params)
dsc_optimizer = create_optimizer(dsc.parameters(), hp['discriminator_optimizer'])
gen_optimizer = create_optimizer(gen.parameters(), hp['generator_optimizer'])
# Loss
def dsc_loss_fn(y_data, y_generated):
return gan.discriminator_loss_fn(y_data, y_generated, hp['data_label'], hp['label_noise'])
def gen_loss_fn(y_generated):
return gan.generator_loss_fn(y_generated, hp['data_label'])
# Training
checkpoint_file = 'project/vanilla_gan'
checkpoint_file_final = f'{checkpoint_file}_final'
if os.path.isfile(f'{checkpoint_file}.pt'):
os.remove(f'{checkpoint_file}.pt')
# Show hypers
print(hp)
import IPython.display
import tqdm
from project.vanilla_gan import train_batch, save_checkpoint
num_epochs = 100
if os.path.isfile(f'{checkpoint_file_final}.pt'):
print(f'*** Loading final checkpoint file {checkpoint_file_final} instead of training')
num_epochs = 0
gen = torch.load(f'{checkpoint_file_final}.pt', map_location=device)
checkpoint_file = checkpoint_file_final
skip=True
try:
dsc_avg_losses, gen_avg_losses = [], []
vanilla_inception_mean, vanilla_inception_std = [], []
for epoch_idx in range(num_epochs):
# We'll accumulate batch losses and show an average once per epoch.
dsc_losses, gen_losses = [], []
print(f'--- EPOCH {epoch_idx+1}/{num_epochs} ---')
with tqdm.tqdm(total=len(dl_train.batch_sampler), file=sys.stdout) as pbar:
for batch_idx, (x_data, _) in enumerate(dl_train):
x_data = x_data.to(device)
dsc_loss, gen_loss = train_batch(
dsc, gen,
dsc_loss_fn, gen_loss_fn,
dsc_optimizer, gen_optimizer,
x_data)
dsc_losses.append(dsc_loss)
gen_losses.append(gen_loss)
pbar.update()
dsc_avg_losses.append(np.mean(dsc_losses))
gen_avg_losses.append(np.mean(gen_losses))
print(f'Discriminator loss: {dsc_avg_losses[-1]}')
print(f'Generator loss: {gen_avg_losses[-1]}')
if save_checkpoint(gen, dsc_avg_losses, gen_avg_losses, checkpoint_file):
print(f'Saved checkpoint.')
samples = gen.sample(5, with_grad=False)
mean, std= inception_score(gen.sample(n=80, with_grad=False).to(device), cuda=True, batch_size=16, resize=True, splits=10)
print(f'inception score: {mean}')
vanilla_inception_mean.append(mean)
vanilla_inception_std.append(std)
fig, _ = plot.tensors_as_images(samples.cpu(), figsize=(6,2))
IPython.display.display(fig)
plt.close(fig)
if not skip:
# Save losses:
import pickle
# Save discriminator and generator losses
with open('project/vanilla_avg_dsc_losses.pkl', 'wb') as f:
pickle.dump(dsc_avg_losses, f)
print('Save fild: project/vanilla_avg_dsc_losses.pkl')
with open('project/vanilla_avg_gen_losses.pkl', 'wb') as f:
pickle.dump(gen_avg_losses, f)
print('Save fild: project/vanilla_avg_gen_losses.pkl')
with open('project/vanilla_mean_inception.pkl', 'wb') as f:
pickle.dump(vanilla_inception_mean, f)
print('Save fild: project/vanilla_mean_inception.pkl')
with open('project/vanilla_std_inception.pkl', 'wb') as f:
pickle.dump(vanilla_inception_std, f)
print('Save fild: project/vanilla_std_inception.pkl')
except KeyboardInterrupt as e:
print('\n *** Training interrupted by user')
{'batch_size': 32, 'z_dim': 128, 'data_label': 1, 'label_noise': 0.25, 'discriminator_optimizer': {'type': 'Adam', 'weight_decay': 0.02, 'betas': (0.5, 0.99), 'lr': 0.0002}, 'generator_optimizer': {'type': 'Adam', 'weight_decay': 0.02, 'betas': (0.5, 0.99), 'lr': 0.0002}}
*** Loading final checkpoint file project/vanilla_gan_final instead of training
The training process described in HW#3, and we will just load the model and the losses we've got during the training process.
# read discriminator and generator losses
with open('project/vanilla_avg_dsc_losses.pkl', 'rb') as f:
vanilla_dsc_avg_losses = pickle.load(f)
with open('project/vanilla_avg_gen_losses.pkl', 'rb') as f:
vanilla_gen_avg_losses = pickle.load(f)
plt.plot(vanilla_gen_avg_losses, label='Vanilla-GAN Generator')
plt.plot(vanilla_dsc_avg_losses, label='Vanilla-GAN Discriminator')
plt.legend()
plt.title('Vanilla Losses')
plt.show()
Here, we can see that the generator loss starts with sharp improvement at the beginning, and at the end, the loss is a bit noisy. The discriminator loss looks like a constant loss value which can be interpreted as stable discriminator w.r.t. generator capabilities. In other words, we can see that the results are not bad, and the generator loss show improvement but also noisy which can be improved.
Now, we will load the model (~74MB, will not be added to submission file, the same as submitted in HW#3) and display generated images:
import IPython.display
import tqdm
from project.vanilla_gan import train_batch, save_checkpoint
from project.hyperparameters import vanilla_gan_hyperparams
def renamed_load(file_obj):
return RenameUnpickler(file_obj).load()
hp = vanilla_gan_hyperparams()
vanilla_checkpoint_file_final = f'project/vanilla_gan_final.pt'
print(f'*** Loading final checkpoint file {vanilla_checkpoint_file_final} ***')
print(f'Hyperparameters: {hp}')
vanilla_gen = torch.load(vanilla_checkpoint_file_final, map_location=device)
print('*** Images Generated from best model:')
vanilla_samples = vanilla_gen.sample(n=24, with_grad=False).cpu()
fig, _ = plot.tensors_as_images(vanilla_samples, nrows=3, figsize=(25,12))
*** Loading final checkpoint file project/vanilla_gan_final.pt ***
Hyperparameters: {'batch_size': 32, 'z_dim': 128, 'data_label': 1, 'label_noise': 0.25, 'discriminator_optimizer': {'type': 'Adam', 'weight_decay': 0.02, 'betas': (0.5, 0.99), 'lr': 0.0002}, 'generator_optimizer': {'type': 'Adam', 'weight_decay': 0.02, 'betas': (0.5, 0.99), 'lr': 0.0002}}
*** Images Generated from best model:
Spectral Normalization is a method for weight normalization that aim to deal with the discriminator instability of GAN training process. What it acctually done is rescalling the weight tensor with specral norm of that matrix. Spectral normalization is simply normalization method which normalize each value of a weight matrix by it largest singular value. It can be formalized mathematically: $$W_{SN} = \frac{W}{\sigma(W)}$$ where, $\sigma(W)$ is the maximal singular value: $$\sigma(W) = \max_{h:h\neq0}\frac{\norm{Wh}_2}{\norm{h}_2}$$
One of the challenges of this method is find this $\sigma(W)$. This issue is solved by calculating it using power iteration technique, which is cheap and effective. Let's try to dive a little into it and provide short mathmatical explaination of this method:
Let's think of linear transform $W : \mathbb{R}^n \rightarrow \mathbb{R}^m$, and suppose we have a vector $v\in \mathbb{R}^n$, and a vector $u\in\mathbb{R}^m$.
We also can formulate a square matrix $W^T W : \mathbb{R}^n \rightarrow \mathbb{R}^n$. A Power iteration - $$v_{t+1} = \frac{W^T W v_t}{\norm{W^T W v_t}}$$ we can use is to write: $$v_{t} = \frac{(W^T W)^t v}{\norm{(W^T W)^t v}}$$ According to the spectral theorem, we can write $v$ in an orthonormal basis of eigenvectors of $W^T W$. Denote $\left(\lambda_1, \ldots, \lambda_n\right)$ to be the descending eigenvalues of $W^TW$ and $\left(e_1, \ldots, e_n\right)$ will be the corresponding eigenvectors.
The Power iteration can be computed as follow: $$v_{t} = \frac{(W^T W)^t \sum_i v_i e_i}{\norm{(W^T W)^t \sum_i v_i e_i}} = \frac{\sum_i v_i \lambda_i^t e_i}{\norm{\sum_i v_i \lambda_i^t e_i}} = \frac{v_1 \lambda_1^t \sum_i \frac{v_i}{v_1} \left(\frac{\lambda_i}{\lambda_1}\right)^t e_i}{\norm{v_1 \lambda_1^t \sum_i \frac{v_i}{v_1} \left(\frac{\lambda_i}{\lambda_1}\right)^t e_i}}$$
As mentioned before, $\lambda_1$ is the largest eigenvalue of $W^TW$, upon power iteration for $i>1$: $\lim\limits_{t\rightarrow\infty}\frac{\lambda_i}{\lambda_1} = 0 $ . Therefore, $v_t$ converges to $e_1$. Additionally, $\frac{W v_t}{\norm{W v_t}} = u_t$. After that, the power iteration can be written as: $$ u_{t+1} = W v_t\\ v_{t+1} = W^T u_{t+1} $$
The singular values of $W_T$ and $W$ are the same, thus it must be that the spectral norm is $\sigma(W) = \sqrt{\lambda_1} = \norm{W v}$ . Since $\norm{u}$ is of unit length, the spectral norm can be computed as follows: $$\sigma(W) = \norm{W v} = u^T W v$$
Now, the algorithm of spectral normalization should appear simple. For every weight in our network, we randomly initialize vectors $u$ and $v$. Because the weights change slowly, we only need to perform a single power iteration on the current version of these vectors for each step of learning.
We used the pytorch nn.utils.Spectral_Norm() layer which does exactly what was described before.
In order to make a correct comparison, and see the affect of normalization changing, we will use the same network as we provide in the Vanilla GAN section except for changing the nn.conv2d->nn.BatchNorm layer with nn.utils.Spectral_Norm(nn.conv2d).
Let's print the architecture:
import project.SN_gan as SN_gan
dsc_SN = SN_gan.Discriminator(in_size=x0[0].shape).to(device)
print(dsc_SN)
d0 = dsc_SN(x0)
print(d0.shape)
test.assertSequenceEqual(d0.shape, (1,1))
z_dim = 128
gen_SN = SN_gan.Generator(z_dim, 4).to(device)
print(gen_SN)
z = torch.randn(1, z_dim).to(device)
xr = gen_SN(z)
print(xr.shape)
test.assertSequenceEqual(x0.shape, xr.shape)
Discriminator(
(Discriminator_net): Sequential(
(0): Conv2d(3, 128, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(1): LeakyReLU(negative_slope=0.01)
(2): Conv2d(128, 256, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(3): LeakyReLU(negative_slope=0.01)
(4): Conv2d(256, 512, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(5): LeakyReLU(negative_slope=0.01)
(6): Conv2d(512, 1024, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
(7): LeakyReLU(negative_slope=0.01)
)
(linear): Linear(in_features=16384, out_features=1, bias=True)
)
torch.Size([1, 1])
Generator(
(unlinear): Linear(in_features=128, out_features=16384, bias=True)
(Decoder): Sequential(
(0): ConvTranspose2d(1024, 512, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): LeakyReLU(negative_slope=0.01)
(3): ConvTranspose2d(512, 256, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
(4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): LeakyReLU(negative_slope=0.01)
(6): ConvTranspose2d(256, 128, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
(7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): LeakyReLU(negative_slope=0.01)
(9): ConvTranspose2d(128, 3, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), output_padding=(1, 1))
(10): Tanh()
)
)
torch.Size([1, 3, 64, 64])
As we have seen in various implementations of SN-GAN, the specral normalization is added to the discriminator network. According to that, we removed the batch normaliztion layers in the discrimintor and change in to use the spectral normalization. Note that it isn't printed above!
import torch.optim as optim
from torch.utils.data import DataLoader
from project.hyperparameters import sn_gan_hyperparams
torch.manual_seed(42)
# Hyperparams
hp = sn_gan_hyperparams()
batch_size = hp['batch_size']
z_dim = hp['z_dim']
# Data
dl_train = DataLoader(ds_gwb, batch_size, shuffle=True)
im_size = ds_gwb[0][0].shape
# Model
dsc_sn = SN_gan.Discriminator(im_size).to(device)
gen_sn = SN_gan.Generator(z_dim, featuremap_size=4).to(device)
# Optimizer
def create_optimizer(model_params, opt_params):
opt_params = opt_params.copy()
optimizer_type = opt_params['type']
opt_params.pop('type')
return optim.__dict__[optimizer_type](model_params, **opt_params)
dsc_optimizer = create_optimizer(dsc_sn.parameters(), hp['discriminator_optimizer'])
gen_optimizer = create_optimizer(gen_sn.parameters(), hp['generator_optimizer'])
# Loss
def dsc_loss_fn(y_data, y_generated):
return SN_gan.discriminator_loss_fn(y_data, y_generated, hp['data_label'], hp['label_noise'])
def gen_loss_fn(y_generated):
return SN_gan.generator_loss_fn(y_generated, hp['data_label'])
# Training
checkpoint_file = 'project/sn_gan'
checkpoint_file_final = f'{checkpoint_file}_final'
if os.path.isfile(f'{checkpoint_file}.pt'):
os.remove(f'{checkpoint_file}.pt')
# Show hypers
print(hp)
{'batch_size': 32, 'z_dim': 128, 'data_label': 1, 'label_noise': 0.25, 'discriminator_optimizer': {'type': 'Adam', 'weight_decay': 0.02, 'betas': (0.5, 0.99), 'lr': 0.0002}, 'generator_optimizer': {'type': 'Adam', 'weight_decay': 0.02, 'betas': (0.5, 0.99), 'lr': 0.0002}}
import IPython.display
import tqdm
from project.SN_gan import train_batch, save_checkpoint
num_epochs = 100
skip=False
if os.path.isfile(f'{checkpoint_file_final}.pt'):
print(f'*** Loading final checkpoint file {checkpoint_file_final} instead of training')
num_epochs = 0
gen = torch.load(f'{checkpoint_file_final}.pt', map_location=device)
checkpoint_file = checkpoint_file_final
skip=True
try:
dsc_avg_losses, gen_avg_losses = [], []
sn_inception_mean, sn_inception_std = [], []
for epoch_idx in range(num_epochs):
# We'll accumulate batch losses and show an average once per epoch.
dsc_losses, gen_losses = [], []
print(f'--- EPOCH {epoch_idx+1}/{num_epochs} ---')
with tqdm.tqdm(total=len(dl_train.batch_sampler), file=sys.stdout) as pbar:
for batch_idx, (x_data, _) in enumerate(dl_train):
x_data = x_data.to(device)
dsc_loss, gen_loss = train_batch(
dsc_sn, gen_sn,
dsc_loss_fn, gen_loss_fn,
dsc_optimizer, gen_optimizer,
x_data)
dsc_losses.append(dsc_loss)
gen_losses.append(gen_loss)
pbar.update()
dsc_avg_losses.append(np.mean(dsc_losses))
gen_avg_losses.append(np.mean(gen_losses))
print(f'Discriminator loss: {dsc_avg_losses[-1]}')
print(f'Generator loss: {gen_avg_losses[-1]}')
if save_checkpoint(gen_sn, dsc_avg_losses, gen_avg_losses, checkpoint_file):
print(f'Saved checkpoint.')
samples = gen_sn.sample(5, with_grad=False)
mean, std= inception_score(gen.sample(n=80, with_grad=False).to(device), cuda=True, batch_size=16, resize=True, splits=10)
print(f'inception score: {mean}')
sn_inception_mean.append(mean)
sn_inception_std.append(std)
fig, _ = plot.tensors_as_images(samples.cpu(), figsize=(6,2))
IPython.display.display(fig)
plt.close(fig)
except KeyboardInterrupt as e:
print('\n *** Training interrupted by user')
*** Loading final checkpoint file project/sn_gan_final instead of training
if not skip:
# Save discriminator and generator losses
with open('project/sn_avg_dsc_losses.pkl', 'wb') as f:
pickle.dump(dsc_avg_losses, f)
print('Save fild: project/sn_avg_dsc_losses.pkl')
with open('project/sn_avg_gen_losses.pkl', 'wb') as f:
pickle.dump(gen_avg_losses, f)
print('Save fild: project/sn_avg_gen_losses.pkl')
with open('project/sn_inception_mean.pkl', 'wb') as f:
pickle.dump(sn_inception_mean, f)
print('Save fild: project/sn_inception_mean.pkl')
with open('project/sn_inception_std.pkl', 'wb') as f:
pickle.dump(sn_inception_std, f)
print('Save fild: project/sn_inception_std.pkl')
# read discriminator and generator losses
with open('project/sn_avg_dsc_losses.pkl', 'rb') as f:
sn_dsc_avg_losses = pickle.load(f)
with open('project/sn_avg_gen_losses.pkl', 'rb') as f:
sn_gen_avg_losses = pickle.load(f)
plt.plot(sn_gen_avg_losses, label='SN-GAN Generator')
plt.plot(sn_dsc_avg_losses, label='SN-GAN Discriminator')
plt.legend()
plt.title('SN-GAN Losses')
plt.xlabel('epoch')
plt.ylabel('loss')
plt.show()
Text(0, 0.5, 'loss')
The figure above is different from the vanilla GAN at the point that the SN-GAN generator improves quickly to get lower loss (around epoch 5) than the discriminator (in the vanilla GAN the generator is above all the time). We can see that later the discriminator loss starts to get better loss and then we get another intersection between the curves (around epoch 40) - it doesn't mean that the generator became worst, it means that they both improve but the discriminator is doing a better job (we also can see similar behavior in the vanilla GAN, although it is noisier). According to that, we choose the last epoch model and show their results here. Additionally, compare to the vanilla GAN generator loss (it comparable because is similar net and losses terms), we got here lower loss values - which is a good indicator to have a better result as we will show in the next cell.
# Plot images from best or last model
if os.path.isfile(f'{checkpoint_file}.pt'):
sn_gen = torch.load(f'{checkpoint_file}.pt', map_location=device)
print(checkpoint_file)
print('*** Images Generated from best model:')
samples = sn_gen.sample(n=24, with_grad=False).cpu()
fig, _ = plot.tensors_as_images(samples, nrows=3, figsize=(25,12))
project/sn_gan_final *** Images Generated from best model:
As we can see, by this simple modification we inproved our generator performance. The images generated from the SN-GAN are more clean and smooth, and with better face details. As a reminder, we want to notice that we didn't change anything except the batch normalization to spectral normalization.
In this section, we will use the vanilla GAN provided in the first section and try to improve it using the principles shown in the Wasserstein GAN paper.
From this paper we learn and implement the following points:
Wasserstein Loss: We change the loss function as described in the paper. There is no cross-entropy or any $log$ in the Wasserstein losses, instead of it we use the next terms:
Generator loss: here we simply use $D(G(z))$, or $ \mathbb{E}_{z \sim p(z)}\big[ \Delta_\delta(g_\theta(z)) \big]$ in other words.
Note that our optimizer aims to minimize the loss function. Therefore, in the code, we return the negative value of the losses above.
Additionally, the code uses class-score vectors, and thus we return torch.mean() of the negative value as mention above.
The output of the discriminator is no longer a probability, and according to that, we didn't add sigmoid at the output of it.
Clip discriminator weights: The paper suggests to clip the weight of the discriminator, hench, in the train_batch function, at the discriminator part we use param.data.clamp_(-c, c) for each parameter in the discriminator. As used in the paper algorithm, we also use the same clipping parameter value $c=0.01$.
Train the discriminator more than the generator: To do that, we simply added an internal loop in the train_batch function that doing 2 cycles of discriminator training for each loop (thus train the discriminator twice than the generator).
Optimizer: The paper uses the RMSProp optimizer, so we use it also instead of ADAM in the previous networks.
Lower learning rate: The paper uses a learning rate of $0.00005$, we used the same value.
Another change we have done is using a lower z_dim, when we understood that this is the easiest change we can do to got better results (without is we have got worst results than the vanilla GAN, which we are trying to improve).
Using that key insight from the paper, we tweak our vanilla GAN and train to see the improvements:
import project.wgan as wgan
dsc_wgan = wgan.Discriminator(in_size=x0[0].shape).to(device)
# print(dsc_wgan) #same net
d0 = dsc_wgan(x0)
# print(d0.shape)
test.assertSequenceEqual(d0.shape, (1,1))
z_dim = 128
gen_wgan = wgan.Generator(z_dim, 4).to(device)
# print(gen_wgan) #same net
z = torch.randn(1, z_dim).to(device)
xr = gen_wgan(z)
# print(xr.shape)
test.assertSequenceEqual(x0.shape, xr.shape)
import torch.optim as optim
from torch.utils.data import DataLoader
from project.hyperparameters import wgan_hyperparams
torch.manual_seed(42)
# Hyperparams
hp = wgan_hyperparams()
batch_size = hp['batch_size']
z_dim = hp['z_dim']
# Data
dl_train = DataLoader(ds_gwb, batch_size, shuffle=True)
im_size = ds_gwb[0][0].shape
# Model
dsc_wgan = wgan.Discriminator(im_size).to(device)
gen_wgan = wgan.Generator(z_dim, featuremap_size=4).to(device)
# Optimizer
def create_optimizer(model_params, opt_params):
opt_params = opt_params.copy()
optimizer_type = opt_params['type']
opt_params.pop('type')
return optim.__dict__[optimizer_type](model_params, **opt_params)
dsc_optimizer = create_optimizer(dsc_wgan.parameters(), hp['discriminator_optimizer'])
gen_optimizer = create_optimizer(gen_wgan.parameters(), hp['generator_optimizer'])
# Loss
def dsc_loss_fn(y_data, y_generated):
return wgan.discriminator_loss_fn(y_data, y_generated, hp['data_label'], hp['label_noise'])
def gen_loss_fn(y_generated):
return wgan.generator_loss_fn(y_generated, hp['data_label'])
# Training
checkpoint_file = 'project/wgan'
checkpoint_file_final = f'{checkpoint_file}_final'
if os.path.isfile(f'{checkpoint_file}.pt'):
os.remove(f'{checkpoint_file}.pt')
# Show hypers
print(hp)
{'batch_size': 32, 'z_dim': 10, 'data_label': 1, 'label_noise': 0.25, 'discriminator_optimizer': {'type': 'RMSprop', 'lr': 5e-05}, 'generator_optimizer': {'type': 'RMSprop', 'lr': 5e-05}}
import IPython.display
import tqdm
from project.wgan import train_batch, save_checkpoint
num_epochs = 100
skip=False
if os.path.isfile(f'{checkpoint_file_final}.pt'):
print(f'*** Loading final checkpoint file {checkpoint_file_final} instead of training')
num_epochs = 0
sn_gen = torch.load(f'{checkpoint_file_final}.pt', map_location=device)
checkpoint_file = checkpoint_file_final
skip=True
try:
dsc_avg_losses, gen_avg_losses = [], []
wgan_inception_mean, wgan_inception_std = [], []
for epoch_idx in range(num_epochs):
# We'll accumulate batch losses and show an average once per epoch.
dsc_losses, gen_losses = [], []
print(f'--- EPOCH {epoch_idx+1}/{num_epochs} ---')
with tqdm.tqdm(total=len(dl_train.batch_sampler), file=sys.stdout) as pbar:
for batch_idx, (x_data, _) in enumerate(dl_train):
x_data = x_data.to(device)
dsc_loss, gen_loss = train_batch(
dsc_wgan, gen_wgan,
dsc_loss_fn, gen_loss_fn,
dsc_optimizer, gen_optimizer,
x_data)
dsc_losses.append(dsc_loss)
gen_losses.append(gen_loss)
pbar.update()
dsc_avg_losses.append(np.mean(dsc_losses))
gen_avg_losses.append(np.mean(gen_losses))
print(f'Discriminator loss: {dsc_avg_losses[-1]}')
print(f'Generator loss: {gen_avg_losses[-1]}')
if save_checkpoint(gen_wgan, dsc_avg_losses, gen_avg_losses, checkpoint_file):
print(f'Saved checkpoint.')
samples = gen_wgan.sample(5, with_grad=False)
mean, std= inception_score(gen.sample(n=80, with_grad=False).to(device), cuda=True, batch_size=16, resize=True, splits=10)
print(f'inception score: {mean}')
wgan_inception_mean.append(mean)
wgan_inception_std.append(std)
fig, _ = plot.tensors_as_images(samples.cpu(), figsize=(6,2))
IPython.display.display(fig)
plt.close(fig)
except KeyboardInterrupt as e:
print('\n *** Training interrupted by user')
*** Loading final checkpoint file project/wgan_final instead of training
if not skip:
# Save discriminator and generator losses
with open('project/wgan_avg_dsc_losses.pkl', 'wb') as f:
pickle.dump(dsc_avg_losses, f)
print('Save fild: project/wgan_avg_dsc_losses.pkl')
with open('project/wgan_avg_gen_losses.pkl', 'wb') as f:
pickle.dump(gen_avg_losses, f)
print('Save fild: project/wgan_avg_gen_losses.pkl')
with open('project/wgan_inception_mean.pkl', 'wb') as f:
pickle.dump(wgan_inception_mean, f)
print('Save fild: project/wgan_inception_mean.pkl')
with open('project/wgan_inception_std.pkl', 'wb') as f:
pickle.dump(wgan_inception_std, f)
print('Save fild: project/wgan_inception_std.pkl')
Save fild: project/wgan_avg_dsc_losses.pkl Save fild: project/wgan_avg_gen_losses.pkl Save fild: project/wgan_inception_mean.pkl Save fild: project/wgan_inception_std.pkl
# read discriminator and generator losses
with open('project/wgan_avg_dsc_losses.pkl', 'rb') as f:
vanilla_dsc_avg_losses = pickle.load(f)
with open('project/wgan_avg_gen_losses.pkl', 'rb') as f:
vanilla_gen_avg_losses = pickle.load(f)
plt.plot(vanilla_gen_avg_losses, label='WGAN Generator')
plt.plot(vanilla_dsc_avg_losses, label='WGAN Discriminator')
plt.legend()
plt.title('WGAN Losses')
plt.show()
Text(0.5, 1.0, 'WGAN Losses')
Here, the losses curves are not represent cross-entropy losses and therefore are not comparable to the values in curves we shown before. What is really important here is the next two points:
First, we train the discriminator twice as much as the generator. That can explain why the discriminator lossis going down faster. The fact that the generator loss is going up doesn't necessarily a bad behavior - in this case, the discriminaotr improves faster and has better probably has better capabilities than before. Thus, it difficult to the generator to make it doing mistakes and that reflected in the losses curves. Although the loss increases, it can be seen during training that it really does improve.
Second, we can see that the loss coverages to some limit, it can be understood that the network goes to a local minimum. Convergence of both networks can indicate problems sometimes, but also may indicates that the framework converge to local minimum that later will be shown as good results.
# Plot images from best or last model
# sn_checkpoint_file_final = f'project/wgan_final.pt'
if os.path.isfile(f'{checkpoint_file}.pt'):
wgan_gen = torch.load(f'{checkpoint_file}.pt', map_location=device)
print('*** Images Generated from best model:')
samples = wgan_gen.sample(n=24, with_grad=False).cpu()
fig, _ = plot.tensors_as_images(samples, nrows=3, figsize=(25,12))
*** Images Generated from best model:
As can be seen, we've also here got better generated image from the generator compared to the vanilla GAN. Visually, it look like the quality is a bit better than the SN-GAN during to the fact the variety of the image is bigger and the face feature are look the same. Note, that here we change more than one network feature, as we wrote in the introduction for this section, so the comparison is not simple here (different hyper-parameters, optimizer, non-equal training between the discriminator and the generator, weight clipping, and different loss functions).
Inception Score (Tim Salimans, et al., 2016), is an measure for evaluating generated synthetic images, specifically from GAN models.
It calculated using the following formula:
$$ IS = exp\left( \mathbb{E}_{x\sim p_g} D_{KL}(p(y|x) \rVert p(y) ) \right) $$where $x\sim p_g$ indicates that x is an image sampled from $p_g$, $D_{KL}(p \rVert q )$ is the KL-divergence between the distributions $p$ and $q$, $p(y|x)$ is the conditional class distribution, and $p(y)$ is the marginal class distribution.
The inception score tries give a score for generated images, by estimating the two quantities:
This metric also was shown to correlate well with human scoring of the realism of generated images from the CIFAR-10 dataset.
We will use the code provide here (project/inception_score.py) to calculate the inception score of our models.
from project.inception_score import inception_score
# (mean, std)
num_of_samples = 320
print ("Final Vanilla GAN Inception Score:")
vanilla_inception_mean, vanilla_inception_std = inception_score(vanilla_gen.sample(n=num_of_samples, with_grad=False).to(device), cuda=True, batch_size=16, resize=True, splits=10)
print (f' mean = {round(vanilla_inception_mean,2)}, std = {round(vanilla_inception_std,2)} \n')
print ("Final SN-GAN Inception Score:")
sn_inception_mean, sn_inception_std = inception_score(sn_gen.sample(n=num_of_samples, with_grad=False).to(device), cuda=True, batch_size=16, resize=True, splits=10)
print (f' mean = {round(sn_inception_mean, 2)}, std = {round(sn_inception_std,2)} \n')
print ("Final WGAN Inception Score:")
wgan_inception_mean, wgan_inception_std = inception_score(wgan_gen.sample(n=num_of_samples, with_grad=False).to(device), cuda=True, batch_size=16, resize=True, splits=10)
print (f' mean = {round(wgan_inception_mean, 2)}, std = {round(wgan_inception_std,2)} \n')
with open('project/vanilla_mean_inception.pkl', 'rb') as f:
vanilla_mean_inception = pickle.load(f)
with open('project/vanilla_std_inception.pkl', 'rb') as f:
vanilla_std_inception = pickle.load(f)
with open('project/sn_inception_mean.pkl', 'rb') as f:
sn_mean_inception = pickle.load(f)
with open('project/sn_inception_std.pkl', 'rb') as f:
sn_std_inception = pickle.load(f)
with open('project/wgan_inception_mean.pkl', 'rb') as f:
wgan_mean_inception = pickle.load(f)
with open('project/wgan_inception_std.pkl', 'rb') as f:
wgan_std_inception = pickle.load(f)
plt.plot(vanilla_mean_inception, label='Vanilla GAN')
plt.plot(sn_mean_inception, label='SN-GAN')
plt.plot(wgan_mean_inception, label='WGAN')
plt.legend()
plt.title('Inception Mean')
plt.show()
plt.plot(vanilla_std_inception, label='Vanilla GAN')
plt.plot(sn_std_inception, label='SN-GAN')
plt.plot(wgan_std_inception, label='WGAN')
plt.legend()
plt.title('Inception STD')
plt.show()
Final Vanilla GAN Inception Score:
mean = 2.11, std = 0.14
Final SN-GAN Inception Score:
mean = 2.51, std = 0.29
Final WGAN Inception Score:
mean = 2.44, std = 0.17
As we can see (and our sight is good critetion, inception score has high correlated to human score as mentioned before), our new models, SN-GAN and WGAN, are better, sharper and with higher diversity. The inception score support our conclusions as the score improve by $ \sim {15\%}$. Additionally,
Additionaly, as we said before, we can see that the WGAN has larget diversity (std) and the quality of both method look pretty similar (mean).
As we see before, the inception score is easy and informative measure of generated data, especially for image. But it has some disadvantages of using it as this paper ("A Note on the Inception Score", Barratt and Sharma, 2018).
The issues that pointed from are:
Suboptimalities of the Inception Score Itself
1.1. Sensitivity to weight: The paper shows that the inception score is sensitive to small changes in network weights that do not affect the final results. A better metric will not be affected by the weight but to the results only. This can be relevant to us, because we clipping the weights in WGAN and not in the other network and that can affect the score.
1.2. Score calculation and exponentiation: one parameter of the inception score is the $n_{splits}$, which in practice decrease the number of sample $N$ to $\frac{N}{n_{splits}}$. Typicallu a value of $n_{splits}=10$ is chosen and that make it harder to get enough sample for good statistics w.r.t. the number of classes. In out example it didn't affect much because we aim to let the generator to work on one class.
Problems with popular usage of inception score
2.1. Usage beyond ImageNet dataset: the inception score works good on the ImageNet dataset, and can work well on the CIFAR-10 but not neccesarily. Different dataset, can have different distribution over the classes ($p(y|x)$), different number of classes etc. Therefore, we don't have any guarantees regarding the generalizability of this measure.
2.2. Optimizing the inceoption score (indirectly & implicitly): This measure shold be a rough guide to detect good image. Trying to improve this score indirectly can produces adverserial examples that will have high score even thogh they are not good at all.
2.3. Not reporting overfitting: Some algorithm can memorized training einstances, that means they suffered from overfitting, but that would perform extremely well in terms of inception score.
It seems like, in out example the score is relatively a good measure, but between the WGAN and the SN-GAN it won't be a good criterion to select which one is better. Althogh, we see that out modification helps the model to get better performance.